Files

李顺东 3cacaf040a feat(job_crawler): enhance logging and tracking for data filtering and Kafka production

- Add logging when API returns empty data to track offset progression
- Track expired job count separately from valid filtered jobs
- Initialize produced counter to handle cases with no filtered jobs
- Consolidate logging into single comprehensive info log per batch
- Log includes: total fetched, valid, expired, and Kafka-produced counts
- Improves observability for debugging data flow and filtering efficiency

2026-01-15 17:59:12 +08:00

app

feat(job_crawler): enhance logging and tracking for data filtering and Kafka production

2026-01-15 17:59:12 +08:00

config

feat(job_crawler): implement reverse-order incremental crawling with real-time Kafka publishing

2026-01-15 17:46:55 +08:00

.dockerignore

feat(job_crawler): initialize job crawler service with kafka integration

2026-01-15 17:09:43 +08:00

deploy.bat

docs(job_crawler): add deployment guide and scripts for Linux/Mac/Windows

2026-01-15 17:12:51 +08:00

DEPLOY.md

docs(job_crawler): add deployment guide and scripts for Linux/Mac/Windows

2026-01-15 17:12:51 +08:00

deploy.sh

docs(job_crawler): add deployment guide and scripts for Linux/Mac/Windows

2026-01-15 17:12:51 +08:00

docker-compose.yml

feat(job_crawler): initialize job crawler service with kafka integration

2026-01-15 17:09:43 +08:00

Dockerfile

feat(job_crawler): initialize job crawler service with kafka integration

2026-01-15 17:09:43 +08:00

README.md

feat(job_crawler): initialize job crawler service with kafka integration

2026-01-15 17:09:43 +08:00

requirements.txt

feat(job_crawler): initialize job crawler service with kafka integration

2026-01-15 17:09:43 +08:00

README.md

招聘数据增量采集服务

从八爪鱼API采集招聘数据，筛选近7天发布的数据，通过内置Kafka服务提供消息队列供外部消费。

项目结构

job_crawler/
├── app/                        # 应用代码
│   ├── api/                    # API路由
│   ├── core/                   # 核心配置
│   ├── models/                 # 数据模型
│   ├── services/               # 业务服务
│   ├── utils/                  # 工具函数
│   └── main.py
├── config/                     # 配置文件目录（挂载）
│   ├── config.yml              # 配置文件
│   └── config.yml.docker       # Docker配置模板
├── docker-compose.yml
├── Dockerfile
└── requirements.txt

快速开始

1. 配置

cd job_crawler

# 复制配置模板
cp config/config.yml.docker config/config.yml

# 编辑配置文件，填入账号密码
vim config/config.yml

2. 启动服务

# 启动所有服务
docker-compose up -d

# 查看日志
docker-compose logs -f app

3. 单独构建镜像

# 构建镜像
docker build -t job-crawler:latest .

# 运行（挂载配置文件）
docker run -d \
  --name job-crawler \
  -p 8000:8000 \
  -v $(pwd)/config:/app/config:ro \
  -v job_data:/app/data \
  job-crawler:latest

配置文件说明

config/config.yml:

app:
  name: job-crawler
  debug: false

api:
  base_url: https://openapi.bazhuayu.com
  username: "your_username"
  password: "your_password"
  batch_size: 100
  # 多任务配置
  tasks:
    - id: "task-id-1"
      name: "青岛招聘数据"
      enabled: true
    - id: "task-id-2"
      name: "上海招聘数据"
      enabled: true
    - id: "task-id-3"
      name: "北京招聘数据"
      enabled: false  # 禁用

kafka:
  bootstrap_servers: kafka:29092
  topic: job_data

crawler:
  filter_days: 7
  max_workers: 5  # 最大并行任务数

database:
  path: /app/data/crawl_progress.db

API接口

接口	方法	说明
`/tasks`	GET	获取所有任务列表
`/status`	GET	查看采集状态（支持task_id参数）
`/crawl/start`	POST	启动采集（支持task_id参数）
`/crawl/stop`	POST	停止采集（支持task_id参数）
`/consume`	GET	消费数据
`/health`	GET	健康检查

使用示例

# 查看所有任务
curl http://localhost:8000/tasks

# 查看所有任务状态
curl http://localhost:8000/status

# 查看单个任务状态
curl "http://localhost:8000/status?task_id=xxx"

# 启动所有任务
curl -X POST http://localhost:8000/crawl/start

# 启动单个任务
curl -X POST "http://localhost:8000/crawl/start?task_id=xxx"

# 停止所有任务
curl -X POST http://localhost:8000/crawl/stop

# 消费数据
curl "http://localhost:8000/consume?batch_size=10"

README.md Unescape Escape

招聘数据增量采集服务

项目结构

快速开始

1. 配置

2. 启动服务

3. 单独构建镜像

配置文件说明

API接口

使用示例

README.md