ocups-kafka/job_crawler/README.md

# 招聘数据增量采集服务

从八爪鱼API采集招聘数据，筛选近7天发布的数据，通过内置Kafka服务提供消息队列供外部消费。

## 项目结构

```
job_crawler/
├── app/                        # 应用代码
│   ├── api/                    # API路由
│   ├── core/                   # 核心配置
│   ├── models/                 # 数据模型
│   ├── services/               # 业务服务
│   ├── utils/                  # 工具函数
│   └── main.py
├── config/                     # 配置文件目录（挂载）
│   ├── config.yml              # 配置文件
│   └── config.yml.docker       # Docker配置模板
├── docker-compose.yml
├── Dockerfile
└── requirements.txt
```

## 快速开始

### 1. 配置

```bash
cd job_crawler

# 复制配置模板
cp config/config.yml.docker config/config.yml

# 编辑配置文件，填入账号密码
vim config/config.yml
```

### 2. 启动服务

```bash
# 启动所有服务
docker-compose up -d

# 查看日志
docker-compose logs -f app
```

### 3. 单独构建镜像

```bash
# 构建镜像
docker build -t job-crawler:latest .

# 运行（挂载配置文件）
docker run -d \
  --name job-crawler \
  -p 8000:8000 \
  -v $(pwd)/config:/app/config:ro \
  -v job_data:/app/data \
  job-crawler:latest
```

## 配置文件说明

`config/config.yml`:

```yaml
app:
  name: job-crawler
  debug: false

api:
  base_url: https://openapi.bazhuayu.com
  username: "your_username"
  password: "your_password"
  batch_size: 100
  # 多任务配置
  tasks:
    - id: "task-id-1"
      name: "青岛招聘数据"
      enabled: true
    - id: "task-id-2"
      name: "上海招聘数据"
      enabled: true
    - id: "task-id-3"
      name: "北京招聘数据"
      enabled: false  # 禁用

kafka:
  bootstrap_servers: kafka:29092
  topic: job_data

crawler:
  filter_days: 7
  max_workers: 5  # 最大并行任务数

database:
  path: /app/data/crawl_progress.db
```

## API接口

| 接口 | 方法 | 说明 |
|------|------|------|
| `/tasks` | GET | 获取所有任务列表 |
| `/status` | GET | 查看采集状态（支持task_id参数） |
| `/crawl/start` | POST | 启动采集（支持task_id参数） |
| `/crawl/stop` | POST | 停止采集（支持task_id参数） |
| `/consume` | GET | 消费数据 |
| `/health` | GET | 健康检查 |

### 使用示例

```bash
# 查看所有任务
curl http://localhost:8000/tasks

# 查看所有任务状态
curl http://localhost:8000/status

# 查看单个任务状态
curl "http://localhost:8000/status?task_id=xxx"

# 启动所有任务
curl -X POST http://localhost:8000/crawl/start

# 启动单个任务
curl -X POST "http://localhost:8000/crawl/start?task_id=xxx"

# 停止所有任务
curl -X POST http://localhost:8000/crawl/stop

# 消费数据
curl "http://localhost:8000/consume?batch_size=10"
```