Files
ocups-kafka/job_crawler/README.md
李顺东 ae681575b9 feat(job_crawler): initialize job crawler service with kafka integration
- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation
2026-01-15 17:09:43 +08:00

136 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 招聘数据增量采集服务
从八爪鱼API采集招聘数据筛选近7天发布的数据通过内置Kafka服务提供消息队列供外部消费。
## 项目结构
```
job_crawler/
├── app/ # 应用代码
│ ├── api/ # API路由
│ ├── core/ # 核心配置
│ ├── models/ # 数据模型
│ ├── services/ # 业务服务
│ ├── utils/ # 工具函数
│ └── main.py
├── config/ # 配置文件目录(挂载)
│ ├── config.yml # 配置文件
│ └── config.yml.docker # Docker配置模板
├── docker-compose.yml
├── Dockerfile
└── requirements.txt
```
## 快速开始
### 1. 配置
```bash
cd job_crawler
# 复制配置模板
cp config/config.yml.docker config/config.yml
# 编辑配置文件,填入账号密码
vim config/config.yml
```
### 2. 启动服务
```bash
# 启动所有服务
docker-compose up -d
# 查看日志
docker-compose logs -f app
```
### 3. 单独构建镜像
```bash
# 构建镜像
docker build -t job-crawler:latest .
# 运行(挂载配置文件)
docker run -d \
--name job-crawler \
-p 8000:8000 \
-v $(pwd)/config:/app/config:ro \
-v job_data:/app/data \
job-crawler:latest
```
## 配置文件说明
`config/config.yml`:
```yaml
app:
name: job-crawler
debug: false
api:
base_url: https://openapi.bazhuayu.com
username: "your_username"
password: "your_password"
batch_size: 100
# 多任务配置
tasks:
- id: "task-id-1"
name: "青岛招聘数据"
enabled: true
- id: "task-id-2"
name: "上海招聘数据"
enabled: true
- id: "task-id-3"
name: "北京招聘数据"
enabled: false # 禁用
kafka:
bootstrap_servers: kafka:29092
topic: job_data
crawler:
filter_days: 7
max_workers: 5 # 最大并行任务数
database:
path: /app/data/crawl_progress.db
```
## API接口
| 接口 | 方法 | 说明 |
|------|------|------|
| `/tasks` | GET | 获取所有任务列表 |
| `/status` | GET | 查看采集状态支持task_id参数 |
| `/crawl/start` | POST | 启动采集支持task_id参数 |
| `/crawl/stop` | POST | 停止采集支持task_id参数 |
| `/consume` | GET | 消费数据 |
| `/health` | GET | 健康检查 |
### 使用示例
```bash
# 查看所有任务
curl http://localhost:8000/tasks
# 查看所有任务状态
curl http://localhost:8000/status
# 查看单个任务状态
curl "http://localhost:8000/status?task_id=xxx"
# 启动所有任务
curl -X POST http://localhost:8000/crawl/start
# 启动单个任务
curl -X POST "http://localhost:8000/crawl/start?task_id=xxx"
# 停止所有任务
curl -X POST http://localhost:8000/crawl/stop
# 消费数据
curl "http://localhost:8000/consume?batch_size=10"
```