ocups-kafka/docs/技术方案.md

# 招聘数据增量采集与消息队列服务技术方案

## 1. 项目概述

### 1.1 需求背景
从八爪鱼API采集招聘数据，筛选近7天发布的数据，通过RabbitMQ消息队列提供数据消费接口，支持消息级别TTL自动过期。

### 1.2 核心功能
- 增量采集八爪鱼API招聘数据（从后往前采集，最新数据优先）
- 日期过滤（发布日期 + 采集时间均在7天内）
- RabbitMQ消息队列（支持消息TTL，7天自动过期）
- 容器启动自动开始采集
- 提供REST API消费接口

---

## 2. 系统架构

```
┌─────────────────────────────────────────────────────────────────┐
│                         系统架构图                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │  八爪鱼API   │───▶│  采集服务    │───▶│   日期过滤器     │  │
│  │  (数据源)    │    │ (从后往前)   │    │  (7天内数据)     │  │
│  └──────────────┘    └──────────────┘    └────────┬─────────┘  │
│                                                    │            │
│                                                    ▼            │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    RabbitMQ 服务                          │  │
│  │  ┌─────────────────────────────────────────────────────┐ │  │
│  │  │  Queue: job_data                                    │ │  │
│  │  │  - 消息TTL: 7天 (604800000ms)                       │ │  │
│  │  │  - 过期消息自动删除                                  │ │  │
│  │  │  - 持久化存储                                        │ │  │
│  │  └─────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                    │            │
│                                                    ▼            │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    FastAPI 服务                           │  │
│  │  ┌─────────────────┐  ┌─────────────────────────────┐    │  │
│  │  │ GET /consume    │  │ GET /status                 │    │  │
│  │  │ (消费数据)      │  │ (采集状态/进度)             │    │  │
│  │  └─────────────────┘  └─────────────────────────────┘    │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## 3. 技术选型

| 组件 | 技术方案 | 版本 | 说明 |
|------|---------|------|------|
| 运行环境 | Python | 3.11+ | 主开发语言 |
| HTTP客户端 | httpx | 0.27+ | 异步HTTP请求 |
| 消息队列 | RabbitMQ | 3.12+ | 支持消息级别TTL |
| MQ客户端 | pika | 1.3+ | Python RabbitMQ SDK |
| API框架 | FastAPI | 0.109+ | REST接口 |
| 容器编排 | Docker Compose | 2.0+ | 服务部署 |
| 数据存储 | SQLite | 内置 | 存储采集进度 |

---

## 4. 项目结构

```
job_crawler/
├── app/                        # 应用代码
│   ├── api/                    # API路由层
│   │   ├── __init__.py
│   │   └── routes.py           # 路由定义
│   ├── core/                   # 核心配置
│   │   ├── __init__.py
│   │   ├── config.py           # 配置管理
│   │   └── logging.py          # 日志配置
│   ├── models/                 # 数据模型
│   │   ├── __init__.py
│   │   ├── job.py              # 招聘数据模型
│   │   ├── progress.py         # 进度模型
│   │   └── response.py         # 响应模型
│   ├── services/               # 业务服务层
│   │   ├── __init__.py
│   │   ├── api_client.py       # 八爪鱼API客户端
│   │   ├── crawler.py          # 采集核心逻辑
│   │   ├── rabbitmq_service.py # RabbitMQ服务
│   │   └── progress_store.py   # 进度存储
│   ├── utils/                  # 工具函数
│   │   ├── __init__.py
│   │   └── date_parser.py      # 日期解析
│   ├── __init__.py
│   └── main.py                 # 应用入口
├── config/                     # 配置文件
│   ├── config.yml              # 运行配置
│   └── config.yml.docker       # Docker配置模板
├── docker-compose.yml          # 容器编排
├── Dockerfile                  # 应用镜像构建
├── deploy.sh                   # 部署脚本(Linux)
├── deploy.bat                  # 部署脚本(Windows)
├── requirements.txt            # Python依赖
└── README.md                   # 使用说明
```


---

## 5. 核心模块设计

### 5.1 增量采集模块

#### 采集策略（从后往前）
```python
# 增量采集流程
1. 获取数据总数 total
2. 读取上次采集的起始位置 last_start_offset
3. 计算本次采集范围:
   - start_offset = total - batch_size (从最新数据开始)
   - end_offset = last_start_offset (截止到上次位置)
4. 循环采集: offset 从 start_offset 递减到 end_offset
5. 每批数据过滤后立即发送到RabbitMQ
6. 采集完成后保存 last_start_offset = 本次起始位置
```

#### 进度持久化
使用SQLite存储采集进度：
```sql
CREATE TABLE crawl_progress (
    task_id TEXT PRIMARY KEY,
    last_start_offset INTEGER,  -- 上次采集的起始位置
    total INTEGER,
    last_update TIMESTAMP,
    status TEXT,
    filtered_count INTEGER,
    produced_count INTEGER
);
```

### 5.2 日期过滤模块

#### aae397 字段格式解析

| 原始值 | 解析规则 | 示例结果 |
|--------|---------|---------|
| "今天" | 当前日期 | 2026-01-15 |
| "1月13日" | 当年+月日 | 2026-01-13 |
| "1月9日" | 当年+月日 | 2026-01-09 |

#### 过滤逻辑
```python
def is_within_days(aae397: str, collect_time: str, days: int = 7) -> bool:
    """
    判断数据是否在指定天数内
    条件：发布日期 AND 采集时间 都在N天内
    """
    today = datetime.now().date()
    cutoff_date = today - timedelta(days=days)
    
    publish_date = parse_aae397(aae397)
    collect_date = parse_collect_time(collect_time)
    
    return publish_date >= cutoff_date and collect_date >= cutoff_date
```

### 5.3 RabbitMQ服务模块

#### 消息TTL机制
```python
# 队列声明时设置消息TTL
channel.queue_declare(
    queue='job_data',
    durable=True,
    arguments={
        'x-message-ttl': 604800000  # 7天(毫秒)
    }
)

# 发送消息时也设置TTL（双重保障）
channel.basic_publish(
    exchange='',
    routing_key='job_data',
    body=message,
    properties=pika.BasicProperties(
        delivery_mode=2,  # 持久化
        expiration='604800000'  # 7天
    )
)
```

#### 优势
- 消息级别TTL，精确控制每条消息的过期时间
- 过期消息自动删除，无需手动清理
- 队列中始终保持最近7天的有效数据

### 5.4 REST API接口

| 接口 | 方法 | 说明 |
|------|------|------|
| `/consume` | GET | 消费队列数据，支持batch_size参数 |
| `/queue/size` | GET | 获取队列消息数量 |
| `/status` | GET | 查看采集进度和状态 |
| `/tasks` | GET | 获取任务列表 |
| `/crawl/start` | POST | 手动触发采集任务 |
| `/crawl/stop` | POST | 停止采集任务 |

#### 接口详情

**GET /consume**
```json
// Request
GET /consume?batch_size=10

// Response
{
  "code": 0,
  "data": [
    {
      "_id": "uuid",
      "_task_id": "00f3b445-...",
      "_crawl_time": "2026-01-15T10:30:00",
      "Std_class": "机动车司机/驾驶",
      "aca112": "保底1万+五险+港内A2驾驶员",
      "AAB004": "青岛唐盛物流有限公司",
      "acb241": "1-1.5万",
      "aab302": "青岛黄岛区",
      "aae397": "1月13日",
      "Collect_time": "2026-01-15",
      ...
    }
  ],
  "count": 10
}
```

**GET /status**
```json
{
  "code": 0,
  "data": {
    "tasks": [
      {
        "task_id": "00f3b445-...",
        "task_name": "青岛招聘数据",
        "total": 270000,
        "last_start_offset": 269900,
        "status": "completed",
        "filtered_count": 15000,
        "produced_count": 15000,
        "is_running": false
      }
    ],
    "queue_size": 12345,
    "running_count": 0
  }
}
```

---

## 6. 数据模型

### 6.1 原始数据保留
数据采集后保留原始字段名，仅添加元数据：

| 字段 | 说明 |
|------|------|
| _id | 唯一标识(UUID) |
| _task_id | 任务ID |
| _crawl_time | 入库时间 |
| 其他字段 | 保留原始API返回的所有字段 |

### 6.2 RabbitMQ消息格式
```json
{
  "_id": "uuid",
  "_task_id": "00f3b445-d8ec-44e8-88b2-4b971a228b1e",
  "_crawl_time": "2026-01-15T10:30:00",
  "Std_class": "机动车司机/驾驶",
  "aca112": "保底1万+五险+港内A2驾驶员",
  "AAB004": "青岛唐盛物流有限公司",
  "AAB019": "民营",
  "acb241": "1-1.5万",
  "aab302": "青岛黄岛区",
  "AAE006": "青岛市黄岛区...",
  "aae397": "1月13日",
  "Collect_time": "2026-01-15",
  "ACE760": "https://www.zhaopin.com/...",
  "acb22a": "岗位职责...",
  "Experience": "5-10年",
  "aac011": "学历不限",
  "acb240": "1人",
  "AAB022": "交通/运输/物流",
  "Num_employers": "20-99人",
  "AAE004": "张先生/HR",
  "AAB092": "公司简介..."
}
```


---

## 7. 配置说明

### 配置文件 `config/config.yml`

```yaml
# 应用配置
app:
  name: job-crawler
  version: 1.0.0
  debug: false

# 八爪鱼API配置
api:
  base_url: https://openapi.bazhuayu.com
  username: "your_username"
  password: "your_password"
  batch_size: 100
  # 多任务配置
  tasks:
    - id: "00f3b445-d8ec-44e8-88b2-4b971a228b1e"
      name: "青岛招聘数据"
      enabled: true
    - id: "task-id-2"
      name: "任务2"
      enabled: false

# RabbitMQ配置
rabbitmq:
  host: rabbitmq           # Docker内部服务名
  port: 5672
  username: guest
  password: guest
  queue: job_data
  message_ttl: 604800000   # 消息过期时间：7天(毫秒)

# 采集配置
crawler:
  filter_days: 7           # 数据有效期（天）
  max_expired_batches: 3   # 连续过期批次阈值
  max_workers: 5           # 最大并行任务数
  auto_start: true         # 容器启动时自动开始采集

# 数据库配置
database:
  path: data/crawl_progress.db
```

---

## 8. 部署流程

### 8.1 Docker Compose 一键部署

```bash
# 1. 配置
cd job_crawler
cp config/config.yml.docker config/config.yml
# 编辑 config/config.yml 填入账号密码

# 2. 构建镜像
./deploy.sh build

# 3. 启动服务
./deploy.sh up

# 4. 查看日志
./deploy.sh logs

# 5. 查看状态
./deploy.sh status
```

### 8.2 部署脚本命令

| 命令 | 说明 |
|------|------|
| `./deploy.sh build` | 构建镜像 |
| `./deploy.sh up` | 启动服务 |
| `./deploy.sh down` | 停止服务 |
| `./deploy.sh restart` | 重启应用 |
| `./deploy.sh logs` | 查看应用日志 |
| `./deploy.sh status` | 查看服务状态 |
| `./deploy.sh reset` | 清理数据卷并重启 |

### 8.3 服务端口

| 服务 | 端口 | 说明 |
|------|------|------|
| FastAPI | 8000 | HTTP API |
| RabbitMQ | 5672 | AMQP协议 |
| RabbitMQ | 15672 | 管理界面 |

### 8.4 访问地址

- API文档: http://localhost:8000/docs
- RabbitMQ管理界面: http://localhost:15672 (guest/guest)

---

## 9. 数据流向

```
八爪鱼API → 采集服务(过滤7天内数据) → RabbitMQ(TTL=7天) → 第三方消费
                                            ↓
                                      过期自动删除
```

---

## 10. Token自动刷新机制

系统实现了Token自动管理：

1. 首次请求时自动获取Token
2. Token缓存在内存中
3. 请求前检查Token有效期（提前5分钟刷新）
4. 遇到401错误自动重新获取Token

---

## 11. 异常处理

| 异常场景 | 处理策略 |
|---------|---------|
| API请求失败 | 重试3次，指数退避 |
| Token过期 | 自动刷新Token |
| RabbitMQ连接失败 | 自动重连 |
| 日期解析失败 | 记录日志，跳过该条数据 |

---

## 12. 快速启动

```bash
# 1. 配置
cd job_crawler
cp config/config.yml.docker config/config.yml
# 编辑 config/config.yml 填入账号密码

# 2. 一键启动
./deploy.sh build
./deploy.sh up

# 3. 查看采集日志
./deploy.sh logs

# 4. 消费数据
curl http://localhost:8000/consume?batch_size=10

# 5. 查看队列大小
curl http://localhost:8000/queue/size
```
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								# 招聘数据增量采集与消息队列服务技术方案
 								## 1. 项目概述
 								### 1.1 需求背景
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								从八爪鱼API采集招聘数据，筛选近7天发布的数据，通过RabbitMQ消息队列提供数据消费接口，支持消息级别TTL自动过期。
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								### 1.2 核心功能
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								- 增量采集八爪鱼API招聘数据（从后往前采集，最新数据优先）
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								- 日期过滤（发布日期 + 采集时间均在7天内）
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								- RabbitMQ消息队列（支持消息TTL，7天自动过期）
 								- 容器启动自动开始采集
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								- 提供REST API消费接口
 								---
 								## 2. 系统架构
 								```
 								┌─────────────────────────────────────────────────────────────────┐
 								│                         系统架构图                               │
 								├─────────────────────────────────────────────────────────────────┤
 								│                                                                 │
 								│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
 								│  │  八爪鱼API   │───▶│  采集服务    │───▶│   日期过滤器     │  │
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								│  │  (数据源)    │    │ (从后往前)   │    │  (7天内数据)     │  │
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								│  └──────────────┘    └──────────────┘    └────────┬─────────┘  │
 								│                                                    │            │
 								│                                                    ▼            │
 								│  ┌──────────────────────────────────────────────────────────┐  │
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								│  │                    RabbitMQ 服务                          │  │
 								│  │  ┌─────────────────────────────────────────────────────┐ │  │
 								│  │  │  Queue: job_data                                    │ │  │
 								│  │  │  - 消息TTL: 7天 (604800000ms)                       │ │  │
 								│  │  │  - 过期消息自动删除                                  │ │  │
 								│  │  │  - 持久化存储                                        │ │  │
 								│  │  └─────────────────────────────────────────────────────┘ │  │
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								│  └──────────────────────────────────────────────────────────┘  │
 								│                                                    │            │
 								│                                                    ▼            │
 								│  ┌──────────────────────────────────────────────────────────┐  │
 								│  │                    FastAPI 服务                           │  │
 								│  │  ┌─────────────────┐  ┌─────────────────────────────┐    │  │
 								│  │  │ GET /consume    │  │ GET /status                 │    │  │
 								│  │  │ (消费数据)      │  │ (采集状态/进度)             │    │  │
 								│  │  └─────────────────┘  └─────────────────────────────┘    │  │
 								│  └──────────────────────────────────────────────────────────┘  │
 								│                                                                 │
 								└─────────────────────────────────────────────────────────────────┘
 								```
 								---
 								## 3. 技术选型
 								| 组件 | 技术方案 | 版本 | 说明 |
 								|------|---------|------|------|
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| 运行环境 | Python | 3.11+ | 主开发语言 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								| HTTP客户端 | httpx | 0.27+ | 异步HTTP请求 |
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| 消息队列 | RabbitMQ | 3.12+ | 支持消息级别TTL |
 								| MQ客户端 | pika | 1.3+ | Python RabbitMQ SDK |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								| API框架 | FastAPI | 0.109+ | REST接口 |
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| 容器编排 | Docker Compose | 2.0+ | 服务部署 |
 								| 数据存储 | SQLite | 内置 | 存储采集进度 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								---
 								## 4. 项目结构
 								```
 								job_crawler/
 								├── app/                        # 应用代码
 								│   ├── api/                    # API路由层
 								│   │   ├── __init__.py
 								│   │   └── routes.py           # 路由定义
 								│   ├── core/                   # 核心配置
 								│   │   ├── __init__.py
 								│   │   ├── config.py           # 配置管理
 								│   │   └── logging.py          # 日志配置
 								│   ├── models/                 # 数据模型
 								│   │   ├── __init__.py
 								│   │   ├── job.py              # 招聘数据模型
 								│   │   ├── progress.py         # 进度模型
 								│   │   └── response.py         # 响应模型
 								│   ├── services/               # 业务服务层
 								│   │   ├── __init__.py
 								│   │   ├── api_client.py       # 八爪鱼API客户端
 								│   │   ├── crawler.py          # 采集核心逻辑
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								│   │   ├── rabbitmq_service.py # RabbitMQ服务
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								│   │   └── progress_store.py   # 进度存储
 								│   ├── utils/                  # 工具函数
 								│   │   ├── __init__.py
 								│   │   └── date_parser.py      # 日期解析
 								│   ├── __init__.py
 								│   └── main.py                 # 应用入口
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								├── config/                     # 配置文件
 								│   ├── config.yml              # 运行配置
 								│   └── config.yml.docker       # Docker配置模板
 								├── docker-compose.yml          # 容器编排
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								├── Dockerfile                  # 应用镜像构建
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								├── deploy.sh                   # 部署脚本(Linux)
 								├── deploy.bat                  # 部署脚本(Windows)
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								├── requirements.txt            # Python依赖
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								└── README.md                   # 使用说明
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								---
 								## 5. 核心模块设计
 								### 5.1 增量采集模块
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								#### 采集策略（从后往前）
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```python
 								# 增量采集流程
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+. 获取数据总数 total
 . 读取上次采集的起始位置 last_start_offset
 . 计算本次采集范围:
 								   - start_offset = total - batch_size (从最新数据开始)
 								   - end_offset = last_start_offset (截止到上次位置)
 . 循环采集: offset 从 start_offset 递减到 end_offset
 . 每批数据过滤后立即发送到RabbitMQ
 . 采集完成后保存 last_start_offset = 本次起始位置
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```
 								#### 进度持久化
 								使用SQLite存储采集进度：
 								```sql
 								CREATE TABLE crawl_progress (
 								    task_id TEXT PRIMARY KEY,
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    last_start_offset INTEGER,  -- 上次采集的起始位置
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								    total INTEGER,
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    last_update TIMESTAMP,
 								    status TEXT,
 								    filtered_count INTEGER,
 								    produced_count INTEGER
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								);
 								```
 								### 5.2 日期过滤模块
 								#### aae397 字段格式解析
 								| 原始值 | 解析规则 | 示例结果 |
 								|--------|---------|---------|
 								| "今天" | 当前日期 | 2026-01-15 |
 								| "1月13日" | 当年+月日 | 2026-01-13 |
 								| "1月9日" | 当年+月日 | 2026-01-09 |
 								#### 过滤逻辑
 								```python
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								def is_within_days(aae397: str, collect_time: str, days: int = 7) -> bool:
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								    """
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    判断数据是否在指定天数内
 								    条件：发布日期 AND 采集时间 都在N天内
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								    """
 								    today = datetime.now().date()
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    cutoff_date = today - timedelta(days=days)
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    publish_date = parse_aae397(aae397)
 								    collect_date = parse_collect_time(collect_time)
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    return publish_date >= cutoff_date and collect_date >= cutoff_date
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								### 5.3 RabbitMQ服务模块
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								#### 消息TTL机制
 								```python
 								# 队列声明时设置消息TTL
 								channel.queue_declare(
 								    queue='job_data',
 								    durable=True,
 								    arguments={
 								        'x-message-ttl': 604800000  # 7天(毫秒)
 								    }
 								)
 								# 发送消息时也设置TTL（双重保障）
 								channel.basic_publish(
 								    exchange='',
 								    routing_key='job_data',
 								    body=message,
 								    properties=pika.BasicProperties(
 								        delivery_mode=2,  # 持久化
 								        expiration='604800000'  # 7天
 								    )
 								)
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								#### 优势
 								- 消息级别TTL，精确控制每条消息的过期时间
 								- 过期消息自动删除，无需手动清理
 								- 队列中始终保持最近7天的有效数据
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								### 5.4 REST API接口
 								| 接口 | 方法 | 说明 |
 								|------|------|------|
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| `/consume` | GET | 消费队列数据，支持batch_size参数 |
 								| `/queue/size` | GET | 获取队列消息数量 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								| `/status` | GET | 查看采集进度和状态 |
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| `/tasks` | GET | 获取任务列表 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								| `/crawl/start` | POST | 手动触发采集任务 |
 								| `/crawl/stop` | POST | 停止采集任务 |
 								#### 接口详情
 								**GET /consume**
 								```json
 								// Request
 								GET /consume?batch_size=10
 								// Response
 								{
 								  "code": 0,
 								  "data": [
 								    {
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								      "_id": "uuid",
 								      "_task_id": "00f3b445-...",
 								      "_crawl_time": "2026-01-15T10:30:00",
 								      "Std_class": "机动车司机/驾驶",
 								      "aca112": "保底1万+五险+港内A2驾驶员",
 								      "AAB004": "青岛唐盛物流有限公司",
 								      "acb241": "1-1.5万",
 								      "aab302": "青岛黄岛区",
 								      "aae397": "1月13日",
 								      "Collect_time": "2026-01-15",
 								      ...
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								    }
 								  ],
 								  "count": 10
 								}
 								```
 								**GET /status**
 								```json
 								{
 								  "code": 0,
 								  "data": {
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								    "tasks": [
 								      {
 								        "task_id": "00f3b445-...",
 								        "task_name": "青岛招聘数据",
 								        "total": 270000,
 								        "last_start_offset": 269900,
 								        "status": "completed",
 								        "filtered_count": 15000,
 								        "produced_count": 15000,
 								        "is_running": false
 								      }
 								    ],
 								    "queue_size": 12345,
 								    "running_count": 0
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								  }
 								}
 								```
 								---
 								## 6. 数据模型
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								### 6.1 原始数据保留
 								数据采集后保留原始字段名，仅添加元数据：
 								| 字段 | 说明 |
 								|------|------|
 								| _id | 唯一标识(UUID) |
 								| _task_id | 任务ID |
 								| _crawl_time | 入库时间 |
 								| 其他字段 | 保留原始API返回的所有字段 |
 								### 6.2 RabbitMQ消息格式
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```json
 								{
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								  "_id": "uuid",
 								  "_task_id": "00f3b445-d8ec-44e8-88b2-4b971a228b1e",
 								  "_crawl_time": "2026-01-15T10:30:00",
 								  "Std_class": "机动车司机/驾驶",
 								  "aca112": "保底1万+五险+港内A2驾驶员",
 								  "AAB004": "青岛唐盛物流有限公司",
 								  "AAB019": "民营",
 								  "acb241": "1-1.5万",
 								  "aab302": "青岛黄岛区",
 								  "AAE006": "青岛市黄岛区...",
 								  "aae397": "1月13日",
 								  "Collect_time": "2026-01-15",
 								  "ACE760": "https://www.zhaopin.com/...",
 								  "acb22a": "岗位职责...",
 								  "Experience": "5-10年",
 								  "aac011": "学历不限",
 								  "acb240": "1人",
 								  "AAB022": "交通/运输/物流",
 								  "Num_employers": "20-99人",
 								  "AAE004": "张先生/HR",
 								  "AAB092": "公司简介..."
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								}
 								```
 								---
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								## 7. 配置说明
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								### 配置文件 `config/config.yml`
 								```yaml
 								# 应用配置
 								app:
 								  name: job-crawler
 								  version: 1.0.0
 								  debug: false
 								# 八爪鱼API配置
 								api:
 								  base_url: https://openapi.bazhuayu.com
 								  username: "your_username"
 								  password: "your_password"
 								  batch_size: 100
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								  # 多任务配置
 								  tasks:
 								    - id: "00f3b445-d8ec-44e8-88b2-4b971a228b1e"
 								      name: "青岛招聘数据"
 								      enabled: true
 								    - id: "task-id-2"
 								      name: "任务2"
 								      enabled: false
 								# RabbitMQ配置
 								rabbitmq:
 								  host: rabbitmq           # Docker内部服务名
 								  port: 5672
 								  username: guest
 								  password: guest
 								  queue: job_data
 								  message_ttl: 604800000   # 消息过期时间：7天(毫秒)
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								# 采集配置
 								crawler:
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								  filter_days: 7           # 数据有效期（天）
 								  max_expired_batches: 3   # 连续过期批次阈值
 								  max_workers: 5           # 最大并行任务数
 								  auto_start: true         # 容器启动时自动开始采集
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								# 数据库配置
 								database:
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								  path: data/crawl_progress.db
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```
 								---
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								## 8. 部署流程
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								### 8.1 Docker Compose 一键部署
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								```bash
 								# 1. 配置
 								cd job_crawler
 								cp config/config.yml.docker config/config.yml
 								# 编辑 config/config.yml 填入账号密码
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 2. 构建镜像
 								./deploy.sh build
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 3. 启动服务
 								./deploy.sh up
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 4. 查看日志
 								./deploy.sh logs
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 5. 查看状态
 								./deploy.sh status
 								```
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								### 8.2 部署脚本命令
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| 命令 | 说明 |
 								|------|------|
 								| `./deploy.sh build` | 构建镜像 |
 								| `./deploy.sh up` | 启动服务 |
 								| `./deploy.sh down` | 停止服务 |
 								| `./deploy.sh restart` | 重启应用 |
 								| `./deploy.sh logs` | 查看应用日志 |
 								| `./deploy.sh status` | 查看服务状态 |
 								| `./deploy.sh reset` | 清理数据卷并重启 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								### 8.3 服务端口
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| 服务 | 端口 | 说明 |
 								|------|------|------|
 								| FastAPI | 8000 | HTTP API |
 								| RabbitMQ | 5672 | AMQP协议 |
 								| RabbitMQ | 15672 | 管理界面 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								### 8.4 访问地址
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								- API文档: http://localhost:8000/docs
 								- RabbitMQ管理界面: http://localhost:15672 (guest/guest)
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								---
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								## 9. 数据流向
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								```
 								八爪鱼API → 采集服务(过滤7天内数据) → RabbitMQ(TTL=7天) → 第三方消费
 								                                            ↓
 								                                      过期自动删除
 								```
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								---
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								## 10. Token自动刷新机制
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								系统实现了Token自动管理：
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+. 首次请求时自动获取Token
 . Token缓存在内存中
 . 请求前检查Token有效期（提前5分钟刷新）
 . 遇到401错误自动重新获取Token
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								---
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								## 11. 异常处理
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								| 异常场景 | 处理策略 |
 								|---------|---------|
 								| API请求失败 | 重试3次，指数退避 |
 								| Token过期 | 自动刷新Token |
 								| RabbitMQ连接失败 | 自动重连 |
 								| 日期解析失败 | 记录日志，跳过该条数据 |
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								---
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								## 12. 快速启动
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
 								```bash
 								# 1. 配置
 								cd job_crawler
 								cp config/config.yml.docker config/config.yml
 								# 编辑 config/config.yml 填入账号密码
 								# 2. 一键启动
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								./deploy.sh build
 								./deploy.sh up
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 3. 查看采集日志
 								./deploy.sh logs
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 4. 消费数据
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								curl http://localhost:8000/consume?batch_size=10
-												rabbitmq

											
										
										
											2026-01-15 21:20:57 +08:00
+								# 5. 查看队列大小
 								curl http://localhost:8000/queue/size
-												feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation

											
										
										
											2026-01-15 17:09:43 +08:00
+								```