feat(job_crawler): implement reverse-order incremental crawling with real-time Kafka publishing

- Add comprehensive sequence diagrams documenting container startup, task initialization, and incremental crawling flow - Implement reverse-order crawling logic (from latest to oldest) to optimize performance by processing new data first - Add real-time Kafka message publishing after each batch filtering instead of waiting for task completion - Update progress tracking to store last_start_offset for accurate incremental crawling across sessions - Enhance crawler service with improved offset calculation and batch processing logic - Update configuration files to support new crawling parameters and Kafka integration - Add progress model enhancements to track crawling state and handle edge cases - Improve main application initialization to properly handle lifespan events and task auto-start This change enables efficient incremental data collection where new data is prioritized and published immediately, reducing latency and improving system responsiveness.
2026-01-15 17:46:55 +08:00
parent 63cd432a0c
commit 3acc0a9221
8 changed files with 402 additions and 60 deletions
--- a/job_crawler/config/config.yml
+++ b/job_crawler/config/config.yml
@@ -26,7 +26,7 @@ api:

 # Kafka配置
 kafka:
-  bootstrap_servers: localhost:9092
+  bootstrap_servers: kafka:29092
  topic: job_data
  consumer_group: job_consumer_group

@@ -35,6 +35,8 @@ crawler:
  interval: 300          # 采集间隔(秒)
  filter_days: 7         # 过滤天数
  max_workers: 5         # 最大并行任务数
+  max_expired_batches: 3 # 连续过期批次阈值（首次采集时生效）
+  auto_start: true       # 容器启动时自动开始采集

 # 数据库配置
 database: