Files
ocups-kafka/job_crawler/config/config.yml.docker
李顺东 3acc0a9221 feat(job_crawler): implement reverse-order incremental crawling with real-time Kafka publishing
- Add comprehensive sequence diagrams documenting container startup, task initialization, and incremental crawling flow
- Implement reverse-order crawling logic (from latest to oldest) to optimize performance by processing new data first
- Add real-time Kafka message publishing after each batch filtering instead of waiting for task completion
- Update progress tracking to store last_start_offset for accurate incremental crawling across sessions
- Enhance crawler service with improved offset calculation and batch processing logic
- Update configuration files to support new crawling parameters and Kafka integration
- Add progress model enhancements to track crawling state and handle edge cases
- Improve main application initialization to properly handle lifespan events and task auto-start
This change enables efficient incremental data collection where new data is prioritized and published immediately, reducing latency and improving system responsiveness.
2026-01-15 17:46:55 +08:00

42 lines
906 B
Docker
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Docker环境配置文件
# 复制此文件为 config.yml 并修改账号密码
# 应用配置
app:
name: job-crawler
version: 1.0.0
debug: false
# 八爪鱼API配置
api:
base_url: https://openapi.bazhuayu.com
username: "your_username"
password: "your_password"
batch_size: 100
# 多任务配置
tasks:
- id: "00f3b445-d8ec-44e8-88b2-4b971a228b1e"
name: "青岛招聘数据"
enabled: true
- id: "task-id-2"
name: "任务2"
enabled: false
# Kafka配置Docker内部网络
kafka:
bootstrap_servers: kafka:29092
topic: job_data
consumer_group: job_consumer_group
# 采集配置
crawler:
interval: 300
filter_days: 7
max_workers: 5
max_expired_batches: 3 # 连续过期批次阈值(首次采集时生效)
auto_start: true # 容器启动时自动开始采集
# 数据库配置
database:
path: /app/data/crawl_progress.db