feat(job_crawler): initialize job crawler service with kafka integration

- Add technical documentation (技术方案.md) with system architecture and design details
- Create FastAPI application structure with modular organization (api, core, models, services, utils)
- Implement job data crawler service with incremental collection from third-party API
- Add Kafka service integration with Docker Compose configuration for message queue
- Create data models for job listings, progress tracking, and API responses
- Implement REST API endpoints for data consumption (/consume, /status) and task management
- Add progress persistence layer using SQLite for tracking collection offsets
- Implement date filtering logic to extract data published within 7 days
- Create API client service for third-party data source integration
- Add configuration management with environment-based settings
- Include Docker support with Dockerfile and docker-compose.yml for containerized deployment
- Add logging configuration and utility functions for date parsing
- Include requirements.txt with all Python dependencies and README documentation
This commit is contained in:
2026-01-15 17:09:43 +08:00
commit ae681575b9
26 changed files with 1898 additions and 0 deletions

17
job_crawler/.dockerignore Normal file
View File

@@ -0,0 +1,17 @@
__pycache__
*.pyc
*.pyo
*.pyd
.Python
.env
.venv
env/
venv/
.git
.gitignore
*.md
*.db
data/
.idea/
.vscode/
*.log

31
job_crawler/Dockerfile Normal file
View File

@@ -0,0 +1,31 @@
FROM python:3.11-slim
WORKDIR /app
# 安装系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
&& rm -rf /var/lib/apt/lists/*
# 复制依赖文件
COPY requirements.txt .
# 安装Python依赖
RUN pip install --no-cache-dir -r requirements.txt
# 复制应用代码
COPY app/ ./app/
# 创建目录
RUN mkdir -p /app/data /app/config
# 设置环境变量
ENV PYTHONPATH=/app
ENV PYTHONUNBUFFERED=1
ENV CONFIG_PATH=/app/config/config.yml
# 暴露端口
EXPOSE 8000
# 启动命令
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

135
job_crawler/README.md Normal file
View File

@@ -0,0 +1,135 @@
# 招聘数据增量采集服务
从八爪鱼API采集招聘数据筛选近7天发布的数据通过内置Kafka服务提供消息队列供外部消费。
## 项目结构
```
job_crawler/
├── app/ # 应用代码
│ ├── api/ # API路由
│ ├── core/ # 核心配置
│ ├── models/ # 数据模型
│ ├── services/ # 业务服务
│ ├── utils/ # 工具函数
│ └── main.py
├── config/ # 配置文件目录(挂载)
│ ├── config.yml # 配置文件
│ └── config.yml.docker # Docker配置模板
├── docker-compose.yml
├── Dockerfile
└── requirements.txt
```
## 快速开始
### 1. 配置
```bash
cd job_crawler
# 复制配置模板
cp config/config.yml.docker config/config.yml
# 编辑配置文件,填入账号密码
vim config/config.yml
```
### 2. 启动服务
```bash
# 启动所有服务
docker-compose up -d
# 查看日志
docker-compose logs -f app
```
### 3. 单独构建镜像
```bash
# 构建镜像
docker build -t job-crawler:latest .
# 运行(挂载配置文件)
docker run -d \
--name job-crawler \
-p 8000:8000 \
-v $(pwd)/config:/app/config:ro \
-v job_data:/app/data \
job-crawler:latest
```
## 配置文件说明
`config/config.yml`:
```yaml
app:
name: job-crawler
debug: false
api:
base_url: https://openapi.bazhuayu.com
username: "your_username"
password: "your_password"
batch_size: 100
# 多任务配置
tasks:
- id: "task-id-1"
name: "青岛招聘数据"
enabled: true
- id: "task-id-2"
name: "上海招聘数据"
enabled: true
- id: "task-id-3"
name: "北京招聘数据"
enabled: false # 禁用
kafka:
bootstrap_servers: kafka:29092
topic: job_data
crawler:
filter_days: 7
max_workers: 5 # 最大并行任务数
database:
path: /app/data/crawl_progress.db
```
## API接口
| 接口 | 方法 | 说明 |
|------|------|------|
| `/tasks` | GET | 获取所有任务列表 |
| `/status` | GET | 查看采集状态支持task_id参数 |
| `/crawl/start` | POST | 启动采集支持task_id参数 |
| `/crawl/stop` | POST | 停止采集支持task_id参数 |
| `/consume` | GET | 消费数据 |
| `/health` | GET | 健康检查 |
### 使用示例
```bash
# 查看所有任务
curl http://localhost:8000/tasks
# 查看所有任务状态
curl http://localhost:8000/status
# 查看单个任务状态
curl "http://localhost:8000/status?task_id=xxx"
# 启动所有任务
curl -X POST http://localhost:8000/crawl/start
# 启动单个任务
curl -X POST "http://localhost:8000/crawl/start?task_id=xxx"
# 停止所有任务
curl -X POST http://localhost:8000/crawl/stop
# 消费数据
curl "http://localhost:8000/consume?batch_size=10"
```

View File

@@ -0,0 +1,2 @@
"""招聘数据采集服务"""
__version__ = "1.0.0"

View File

@@ -0,0 +1,4 @@
"""API路由模块"""
from .routes import router
__all__ = ["router"]

View File

@@ -0,0 +1,112 @@
"""API路由"""
import json
import logging
from typing import Optional
from fastapi import APIRouter, Query, BackgroundTasks, HTTPException
from fastapi.responses import StreamingResponse
from app.models import ApiResponse, ConsumeResponse, StatusResponse
from app.services import crawler_manager, kafka_service
logger = logging.getLogger(__name__)
router = APIRouter()
@router.get("/", response_model=ApiResponse)
async def root():
"""服务状态"""
return ApiResponse(message="招聘数据采集服务运行中", data={"version": "1.0.0"})
@router.get("/health")
async def health_check():
"""健康检查"""
return {"status": "healthy"}
@router.get("/status", response_model=StatusResponse)
async def get_status(task_id: Optional[str] = Query(None, description="任务ID不传则返回所有任务状态")):
"""获取采集状态"""
status = crawler_manager.get_status(task_id)
return StatusResponse(data=status)
@router.get("/tasks", response_model=ApiResponse)
async def list_tasks():
"""获取所有任务列表"""
tasks = [
{"task_id": tid, "task_name": c.task_name, "is_running": c.is_running}
for tid, c in crawler_manager.get_all_crawlers().items()
]
return ApiResponse(data={"tasks": tasks, "count": len(tasks)})
@router.post("/crawl/start", response_model=ApiResponse)
async def start_crawl(
background_tasks: BackgroundTasks,
task_id: Optional[str] = Query(None, description="任务ID不传则启动所有任务"),
reset: bool = Query(False, description="是否重置进度从头开始")
):
"""启动采集任务"""
if task_id:
# 启动单个任务
crawler = crawler_manager.get_crawler(task_id)
if not crawler:
raise HTTPException(status_code=404, detail=f"任务不存在: {task_id}")
if crawler.is_running:
raise HTTPException(status_code=400, detail=f"任务已在运行中: {task_id}")
background_tasks.add_task(crawler_manager.start_task, task_id, reset)
return ApiResponse(message=f"任务 {task_id} 已启动", data={"task_id": task_id, "reset": reset})
else:
# 启动所有任务
background_tasks.add_task(crawler_manager.start_all, reset)
return ApiResponse(message="所有任务已启动", data={"reset": reset})
@router.post("/crawl/stop", response_model=ApiResponse)
async def stop_crawl(
task_id: Optional[str] = Query(None, description="任务ID不传则停止所有任务")
):
"""停止采集任务"""
if task_id:
crawler = crawler_manager.get_crawler(task_id)
if not crawler:
raise HTTPException(status_code=404, detail=f"任务不存在: {task_id}")
if not crawler.is_running:
raise HTTPException(status_code=400, detail=f"任务未在运行: {task_id}")
crawler_manager.stop_task(task_id)
return ApiResponse(message=f"任务 {task_id} 正在停止")
else:
if not crawler_manager.is_any_running:
raise HTTPException(status_code=400, detail="没有正在运行的任务")
crawler_manager.stop_all()
return ApiResponse(message="所有任务正在停止")
@router.get("/consume", response_model=ConsumeResponse)
async def consume_data(
batch_size: int = Query(10, ge=1, le=100, description="批量大小"),
timeout: int = Query(5000, ge=1000, le=30000, description="超时时间(毫秒)")
):
"""消费Kafka数据"""
try:
messages = kafka_service.consume(batch_size, timeout)
return ConsumeResponse(data=messages, count=len(messages))
except Exception as e:
logger.error(f"消费数据失败: {e}")
raise HTTPException(status_code=500, detail=str(e))
@router.get("/consume/stream")
async def consume_stream():
"""SSE流式消费"""
async def event_generator():
consumer = kafka_service.get_consumer()
try:
for message in consumer:
data = json.dumps(message.value, ensure_ascii=False)
yield f"data: {data}\n\n"
except Exception as e:
logger.error(f"流式消费错误: {e}")
finally:
consumer.close()
return StreamingResponse(event_generator(), media_type="text/event-stream")

View File

@@ -0,0 +1,5 @@
"""核心模块"""
from .config import settings
from .logging import setup_logging
__all__ = ["settings", "setup_logging"]

View File

@@ -0,0 +1,89 @@
"""配置管理"""
import os
import yaml
from typing import Optional, List
from pydantic import BaseModel
from functools import lru_cache
class AppConfig(BaseModel):
name: str = "job-crawler"
version: str = "1.0.0"
debug: bool = False
class TaskConfig(BaseModel):
"""单个任务配置"""
id: str
name: str = ""
enabled: bool = True
class ApiConfig(BaseModel):
base_url: str = "https://openapi.bazhuayu.com"
username: str = ""
password: str = ""
batch_size: int = 100
tasks: List[TaskConfig] = []
class KafkaConfig(BaseModel):
bootstrap_servers: str = "localhost:9092"
topic: str = "job_data"
consumer_group: str = "job_consumer_group"
class CrawlerConfig(BaseModel):
interval: int = 300
filter_days: int = 7
max_workers: int = 5
class DatabaseConfig(BaseModel):
path: str = "data/crawl_progress.db"
class Settings(BaseModel):
"""应用配置"""
app: AppConfig = AppConfig()
api: ApiConfig = ApiConfig()
kafka: KafkaConfig = KafkaConfig()
crawler: CrawlerConfig = CrawlerConfig()
database: DatabaseConfig = DatabaseConfig()
@classmethod
def from_yaml(cls, config_path: str) -> "Settings":
"""从YAML文件加载配置"""
if not os.path.exists(config_path):
return cls()
with open(config_path, 'r', encoding='utf-8') as f:
data = yaml.safe_load(f) or {}
# 解析tasks
api_data = data.get('api', {})
tasks_data = api_data.pop('tasks', [])
tasks = [TaskConfig(**t) for t in tasks_data]
api_config = ApiConfig(**api_data, tasks=tasks)
return cls(
app=AppConfig(**data.get('app', {})),
api=api_config,
kafka=KafkaConfig(**data.get('kafka', {})),
crawler=CrawlerConfig(**data.get('crawler', {})),
database=DatabaseConfig(**data.get('database', {}))
)
def get_enabled_tasks(self) -> List[TaskConfig]:
"""获取启用的任务列表"""
return [t for t in self.api.tasks if t.enabled]
@lru_cache()
def get_settings() -> Settings:
"""获取配置"""
config_path = os.environ.get("CONFIG_PATH", "config/config.yml")
return Settings.from_yaml(config_path)
settings = get_settings()

View File

@@ -0,0 +1,22 @@
"""日志配置"""
import logging
import sys
from .config import settings
def setup_logging():
"""配置日志"""
level = logging.DEBUG if settings.app.debug else logging.INFO
logging.basicConfig(
level=level,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout)
]
)
# 降低第三方库日志级别
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("kafka").setLevel(logging.WARNING)
logging.getLogger("uvicorn").setLevel(logging.INFO)

35
job_crawler/app/main.py Normal file
View File

@@ -0,0 +1,35 @@
"""FastAPI应用入口"""
import logging
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.config import settings
from app.core.logging import setup_logging
from app.api import router
from app.services import kafka_service
setup_logging()
logger = logging.getLogger(__name__)
@asynccontextmanager
async def lifespan(app: FastAPI):
"""应用生命周期管理"""
logger.info("服务启动中...")
yield
logger.info("服务关闭中...")
kafka_service.close()
app = FastAPI(
title="招聘数据采集服务",
description="从八爪鱼API采集招聘数据通过Kafka提供消费接口",
version=settings.app.version,
lifespan=lifespan
)
app.include_router(router)
if __name__ == "__main__":
import uvicorn
uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)

View File

@@ -0,0 +1,13 @@
"""数据模型"""
from .job import JobData
from .progress import CrawlProgress, CrawlStatus
from .response import ApiResponse, ConsumeResponse, StatusResponse
__all__ = [
"JobData",
"CrawlProgress",
"CrawlStatus",
"ApiResponse",
"ConsumeResponse",
"StatusResponse"
]

View File

@@ -0,0 +1,60 @@
"""招聘数据模型"""
from pydantic import BaseModel
from datetime import datetime
import uuid
class JobData(BaseModel):
"""招聘数据模型"""
id: str = ""
task_id: str = "" # 任务ID
job_category: str = "" # Std_class - 职位分类
job_title: str = "" # aca112 - 职位名称
company: str = "" # AAB004 - 公司名称
company_type: str = "" # AAB019 - 企业类型
salary: str = "" # acb241 - 薪资范围
location: str = "" # aab302 - 工作地点
address: str = "" # AAE006 - 详细地址
publish_date: str = "" # aae397 - 发布日期
collect_time: str = "" # Collect_time - 采集时间
url: str = "" # ACE760 - 职位链接
description: str = "" # acb22a - 职位描述
experience: str = "" # Experience - 经验要求
education: str = "" # aac011 - 学历要求
headcount: str = "" # acb240 - 招聘人数
industry: str = "" # AAB022 - 行业
company_size: str = "" # Num_employers - 公司规模
contact: str = "" # AAE004 - 联系人
company_intro: str = "" # AAB092 - 公司简介
crawl_time: str = "" # 入库时间
def __init__(self, **data):
super().__init__(**data)
if not self.id:
self.id = str(uuid.uuid4())
if not self.crawl_time:
self.crawl_time = datetime.now().isoformat()
@classmethod
def from_raw(cls, raw: dict) -> "JobData":
"""从原始API数据转换"""
return cls(
job_category=raw.get("Std_class", ""),
job_title=raw.get("aca112", ""),
company=raw.get("AAB004", ""),
company_type=raw.get("AAB019", "").strip(),
salary=raw.get("acb241", ""),
location=raw.get("aab302", ""),
address=raw.get("AAE006", ""),
publish_date=raw.get("aae397", ""),
collect_time=raw.get("Collect_time", ""),
url=raw.get("ACE760", ""),
description=raw.get("acb22a", ""),
experience=raw.get("Experience", ""),
education=raw.get("aac011", ""),
headcount=raw.get("acb240", ""),
industry=raw.get("AAB022", ""),
company_size=raw.get("Num_employers", ""),
contact=raw.get("AAE004", ""),
company_intro=raw.get("AAB092", ""),
)

View File

@@ -0,0 +1,24 @@
"""采集进度模型"""
from pydantic import BaseModel
class CrawlProgress(BaseModel):
"""采集进度"""
task_id: str
current_offset: int = 0
total: int = 0
last_update: str = ""
status: str = "idle" # idle, running, completed, error
class CrawlStatus(BaseModel):
"""采集状态响应"""
task_id: str
total: int
current_offset: int
progress: str
kafka_lag: int = 0
status: str
last_update: str
filtered_count: int = 0
produced_count: int = 0

View File

@@ -0,0 +1,23 @@
"""API响应模型"""
from pydantic import BaseModel
from typing import Optional, Any
class ApiResponse(BaseModel):
"""通用API响应"""
code: int = 0
message: str = "success"
data: Optional[Any] = None
class ConsumeResponse(BaseModel):
"""消费响应"""
code: int = 0
data: list = []
count: int = 0
class StatusResponse(BaseModel):
"""状态响应"""
code: int = 0
data: dict = {}

View File

@@ -0,0 +1,12 @@
"""服务模块"""
from .api_client import api_client, BazhuayuClient
from .kafka_service import kafka_service, KafkaService
from .progress_store import progress_store, ProgressStore
from .crawler import crawler_manager, CrawlerManager, TaskCrawler
__all__ = [
"api_client", "BazhuayuClient",
"kafka_service", "KafkaService",
"progress_store", "ProgressStore",
"crawler_manager", "CrawlerManager", "TaskCrawler"
]

View File

@@ -0,0 +1,91 @@
"""八爪鱼API客户端"""
import httpx
import time
import logging
from typing import Optional, Dict, Any
from app.core.config import settings
logger = logging.getLogger(__name__)
class BazhuayuClient:
"""八爪鱼API客户端"""
def __init__(self):
self.base_url = settings.api.base_url
self.username = settings.api.username
self.password = settings.api.password
self._access_token: Optional[str] = None
self._token_expires_at: float = 0
async def _get_token(self) -> str:
"""获取访问令牌"""
if self._access_token and time.time() < self._token_expires_at - 300:
return self._access_token
logger.info("正在获取新的access_token...")
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/token",
json={
"username": self.username,
"password": self.password,
"grant_type": "password"
},
headers={
"Content-Type": "application/json",
"Accept": "*/*"
},
timeout=30
)
if response.status_code != 200:
raise Exception(f"获取token失败: {response.status_code} - {response.text}")
data = response.json()
token_data = data.get("data", {})
self._access_token = token_data.get("access_token")
expires_in = int(token_data.get("expires_in", 86400))
self._token_expires_at = time.time() + expires_in
logger.info(f"获取token成功有效期: {expires_in}")
return self._access_token
async def fetch_data(self, task_id: str, offset: int, size: int = 100) -> Dict[str, Any]:
"""获取任务数据"""
token = await self._get_token()
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/data/all",
params={
"taskId": task_id,
"offset": offset,
"size": size
},
headers={
"Authorization": f"Bearer {token}",
"Accept": "*/*"
},
timeout=60
)
if response.status_code == 401:
self._access_token = None
self._token_expires_at = 0
return await self.fetch_data(task_id, offset, size)
if response.status_code != 200:
raise Exception(f"获取数据失败: {response.status_code} - {response.text}")
return response.json()
async def get_total_count(self, task_id: str) -> int:
"""获取数据总数"""
result = await self.fetch_data(task_id, 0, 1)
return result.get("data", {}).get("total", 0)
api_client = BazhuayuClient()

View File

@@ -0,0 +1,209 @@
"""多任务增量采集核心逻辑"""
import asyncio
import logging
from typing import Dict, Optional
from concurrent.futures import ThreadPoolExecutor
from app.services.api_client import api_client
from app.services.kafka_service import kafka_service
from app.services.progress_store import progress_store
from app.utils import is_within_days
from app.models import JobData
from app.core.config import settings, TaskConfig
logger = logging.getLogger(__name__)
class TaskCrawler:
"""单个任务采集器"""
def __init__(self, task_config: TaskConfig):
self.task_id = task_config.id
self.task_name = task_config.name or task_config.id
self.batch_size = settings.api.batch_size
self.filter_days = settings.crawler.filter_days
self._running = False
self._total_filtered = 0
self._total_produced = 0
@property
def is_running(self) -> bool:
return self._running
async def start(self, reset: bool = False):
"""开始采集"""
if self._running:
logger.warning(f"[{self.task_name}] 任务已在运行中")
return
self._running = True
self._total_filtered = 0
self._total_produced = 0
logger.info(f"[{self.task_name}] 开始采集任务")
try:
if reset:
progress_store.reset_progress(self.task_id)
current_offset = 0
else:
progress = progress_store.get_progress(self.task_id)
current_offset = progress.current_offset if progress else 0
total = await api_client.get_total_count(self.task_id)
logger.info(f"[{self.task_name}] 数据总数: {total}, 当前偏移: {current_offset}")
if current_offset >= total:
logger.info(f"[{self.task_name}] 数据已全部采集完成")
progress_store.save_progress(self.task_id, current_offset, total, "completed",
self._total_filtered, self._total_produced)
self._running = False
return
while current_offset < total and self._running:
try:
await self._crawl_batch(current_offset)
current_offset += self.batch_size
progress_store.save_progress(self.task_id, current_offset, total, "running",
self._total_filtered, self._total_produced)
progress_pct = min(100, current_offset / total * 100)
logger.info(f"[{self.task_name}] 进度: {progress_pct:.2f}% ({current_offset}/{total})")
await asyncio.sleep(0.5)
except Exception as e:
logger.error(f"[{self.task_name}] 采集批次失败: {e}")
await asyncio.sleep(5)
status = "completed" if current_offset >= total else "stopped"
progress_store.save_progress(self.task_id, current_offset, total, status,
self._total_filtered, self._total_produced)
logger.info(f"[{self.task_name}] 采集任务 {status}")
except Exception as e:
logger.error(f"[{self.task_name}] 采集任务异常: {e}")
progress_store.save_progress(self.task_id, 0, 0, "error",
self._total_filtered, self._total_produced)
finally:
self._running = False
async def _crawl_batch(self, offset: int):
"""采集一批数据"""
result = await api_client.fetch_data(self.task_id, offset, self.batch_size)
data_list = result.get("data", {}).get("data", [])
if not data_list:
return
filtered_jobs = []
for raw in data_list:
aae397 = raw.get("aae397", "")
collect_time = raw.get("Collect_time", "")
if is_within_days(aae397, collect_time, self.filter_days):
job = JobData.from_raw(raw)
job.task_id = self.task_id # 添加任务ID标识
filtered_jobs.append(job)
self._total_filtered += len(filtered_jobs)
if filtered_jobs:
produced = kafka_service.produce_batch(filtered_jobs)
self._total_produced += produced
def stop(self):
"""停止采集"""
logger.info(f"[{self.task_name}] 正在停止采集任务...")
self._running = False
def get_status(self) -> dict:
"""获取采集状态"""
stats = progress_store.get_stats(self.task_id)
if not stats:
return {
"task_id": self.task_id, "task_name": self.task_name,
"total": 0, "current_offset": 0, "progress": "0%",
"status": "idle", "last_update": "",
"filtered_count": 0, "produced_count": 0
}
total = stats.get("total", 0)
current = stats.get("current_offset", 0)
progress = f"{min(100, current / total * 100):.2f}%" if total > 0 else "0%"
return {
"task_id": self.task_id, "task_name": self.task_name,
"total": total, "current_offset": current, "progress": progress,
"status": stats.get("status", "idle"), "last_update": stats.get("last_update", ""),
"filtered_count": stats.get("filtered_count", 0),
"produced_count": stats.get("produced_count", 0)
}
class CrawlerManager:
"""多任务采集管理器"""
def __init__(self):
self._crawlers: Dict[str, TaskCrawler] = {}
self._executor = ThreadPoolExecutor(max_workers=settings.crawler.max_workers)
self._init_crawlers()
def _init_crawlers(self):
"""初始化所有启用的任务采集器"""
for task in settings.get_enabled_tasks():
self._crawlers[task.id] = TaskCrawler(task)
logger.info(f"初始化任务采集器: {task.name} ({task.id})")
def get_crawler(self, task_id: str) -> Optional[TaskCrawler]:
"""获取指定任务的采集器"""
return self._crawlers.get(task_id)
def get_all_crawlers(self) -> Dict[str, TaskCrawler]:
"""获取所有采集器"""
return self._crawlers
async def start_task(self, task_id: str, reset: bool = False) -> bool:
"""启动单个任务"""
crawler = self._crawlers.get(task_id)
if not crawler:
logger.error(f"任务不存在: {task_id}")
return False
if crawler.is_running:
logger.warning(f"任务已在运行: {task_id}")
return False
asyncio.create_task(crawler.start(reset))
return True
async def start_all(self, reset: bool = False):
"""启动所有任务"""
tasks = []
for task_id, crawler in self._crawlers.items():
if not crawler.is_running:
tasks.append(crawler.start(reset))
if tasks:
await asyncio.gather(*tasks)
def stop_task(self, task_id: str) -> bool:
"""停止单个任务"""
crawler = self._crawlers.get(task_id)
if not crawler:
return False
crawler.stop()
return True
def stop_all(self):
"""停止所有任务"""
for crawler in self._crawlers.values():
crawler.stop()
def get_status(self, task_id: str = None) -> dict:
"""获取状态"""
if task_id:
crawler = self._crawlers.get(task_id)
return crawler.get_status() if crawler else {}
# 返回所有任务状态
return {
"tasks": [c.get_status() for c in self._crawlers.values()],
"kafka_lag": kafka_service.get_lag(),
"running_count": sum(1 for c in self._crawlers.values() if c.is_running)
}
@property
def is_any_running(self) -> bool:
"""是否有任务在运行"""
return any(c.is_running for c in self._crawlers.values())
# 全局管理器实例
crawler_manager = CrawlerManager()

View File

@@ -0,0 +1,138 @@
"""Kafka服务"""
import json
import logging
from typing import List, Optional
from kafka import KafkaProducer, KafkaConsumer
from kafka.errors import KafkaError
from kafka.admin import KafkaAdminClient, NewTopic
from app.models import JobData
from app.core.config import settings
logger = logging.getLogger(__name__)
class KafkaService:
"""Kafka生产者/消费者服务"""
def __init__(self):
self.bootstrap_servers = settings.kafka.bootstrap_servers
self.topic = settings.kafka.topic
self.consumer_group = settings.kafka.consumer_group
self._producer: Optional[KafkaProducer] = None
self._ensure_topic()
def _ensure_topic(self):
"""确保Topic存在"""
try:
admin = KafkaAdminClient(
bootstrap_servers=self.bootstrap_servers,
client_id="job_crawler_admin"
)
existing_topics = admin.list_topics()
if self.topic not in existing_topics:
topic = NewTopic(name=self.topic, num_partitions=3, replication_factor=1)
admin.create_topics([topic])
logger.info(f"创建Topic: {self.topic}")
admin.close()
except Exception as e:
logger.warning(f"检查/创建Topic失败: {e}")
@property
def producer(self) -> KafkaProducer:
"""获取生产者实例"""
if self._producer is None:
self._producer = KafkaProducer(
bootstrap_servers=self.bootstrap_servers,
value_serializer=lambda v: json.dumps(v, ensure_ascii=False).encode('utf-8'),
key_serializer=lambda k: k.encode('utf-8') if k else None,
acks='all',
retries=3
)
return self._producer
def get_consumer(self, auto_offset_reset: str = 'earliest') -> KafkaConsumer:
"""获取消费者实例"""
return KafkaConsumer(
self.topic,
bootstrap_servers=self.bootstrap_servers,
group_id=self.consumer_group,
auto_offset_reset=auto_offset_reset,
enable_auto_commit=True,
value_deserializer=lambda m: json.loads(m.decode('utf-8')),
consumer_timeout_ms=5000
)
def produce(self, job_data: JobData) -> bool:
"""发送消息到Kafka"""
try:
future = self.producer.send(self.topic, key=job_data.id, value=job_data.model_dump())
future.get(timeout=10)
return True
except KafkaError as e:
logger.error(f"发送消息失败: {e}")
return False
def produce_batch(self, job_list: List[JobData]) -> int:
"""批量发送消息"""
success_count = 0
for job in job_list:
if self.produce(job):
success_count += 1
self.producer.flush()
return success_count
def consume(self, batch_size: int = 10, timeout_ms: int = 5000) -> List[dict]:
"""消费消息"""
messages = []
consumer = KafkaConsumer(
self.topic,
bootstrap_servers=self.bootstrap_servers,
group_id=self.consumer_group,
auto_offset_reset='earliest',
enable_auto_commit=True,
value_deserializer=lambda m: json.loads(m.decode('utf-8')),
consumer_timeout_ms=timeout_ms,
max_poll_records=batch_size
)
try:
for message in consumer:
messages.append(message.value)
if len(messages) >= batch_size:
break
except Exception as e:
logger.debug(f"消费超时或完成: {e}")
finally:
consumer.close()
return messages
def get_lag(self) -> int:
"""获取消息堆积量"""
try:
consumer = KafkaConsumer(bootstrap_servers=self.bootstrap_servers, group_id=self.consumer_group)
partitions = consumer.partitions_for_topic(self.topic)
if not partitions:
consumer.close()
return 0
from kafka import TopicPartition
tps = [TopicPartition(self.topic, p) for p in partitions]
end_offsets = consumer.end_offsets(tps)
total_lag = 0
for tp in tps:
committed = consumer.committed(tp)
end = end_offsets.get(tp, 0)
total_lag += max(0, end - (committed or 0))
consumer.close()
return total_lag
except Exception as e:
logger.warning(f"获取lag失败: {e}")
return 0
def close(self):
"""关闭连接"""
if self._producer:
self._producer.close()
self._producer = None
kafka_service = KafkaService()

View File

@@ -0,0 +1,95 @@
"""采集进度存储"""
import sqlite3
import os
import logging
from datetime import datetime
from typing import Optional
from contextlib import contextmanager
from app.models import CrawlProgress
from app.core.config import settings
logger = logging.getLogger(__name__)
class ProgressStore:
"""采集进度存储SQLite"""
def __init__(self, db_path: str = None):
self.db_path = db_path or settings.database.path
os.makedirs(os.path.dirname(self.db_path) or ".", exist_ok=True)
self._init_db()
def _init_db(self):
"""初始化数据库"""
with self._get_conn() as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS crawl_progress (
task_id TEXT PRIMARY KEY,
current_offset INTEGER DEFAULT 0,
total INTEGER DEFAULT 0,
last_update TEXT,
status TEXT DEFAULT 'idle',
filtered_count INTEGER DEFAULT 0,
produced_count INTEGER DEFAULT 0
)
""")
conn.commit()
@contextmanager
def _get_conn(self):
"""获取数据库连接"""
conn = sqlite3.connect(self.db_path)
conn.row_factory = sqlite3.Row
try:
yield conn
finally:
conn.close()
def get_progress(self, task_id: str) -> Optional[CrawlProgress]:
"""获取采集进度"""
with self._get_conn() as conn:
cursor = conn.execute("SELECT * FROM crawl_progress WHERE task_id = ?", (task_id,))
row = cursor.fetchone()
if row:
return CrawlProgress(
task_id=row["task_id"],
current_offset=row["current_offset"],
total=row["total"],
last_update=row["last_update"] or "",
status=row["status"]
)
return None
def save_progress(self, task_id: str, offset: int, total: int,
status: str = "running", filtered_count: int = 0, produced_count: int = 0):
"""保存采集进度"""
now = datetime.now().isoformat()
with self._get_conn() as conn:
conn.execute("""
INSERT INTO crawl_progress
(task_id, current_offset, total, last_update, status, filtered_count, produced_count)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(task_id) DO UPDATE SET
current_offset = excluded.current_offset, total = excluded.total,
last_update = excluded.last_update, status = excluded.status,
filtered_count = excluded.filtered_count, produced_count = excluded.produced_count
""", (task_id, offset, total, now, status, filtered_count, produced_count))
conn.commit()
def get_stats(self, task_id: str) -> dict:
"""获取统计信息"""
with self._get_conn() as conn:
cursor = conn.execute("SELECT * FROM crawl_progress WHERE task_id = ?", (task_id,))
row = cursor.fetchone()
if row:
return dict(row)
return {}
def reset_progress(self, task_id: str):
"""重置采集进度"""
with self._get_conn() as conn:
conn.execute("DELETE FROM crawl_progress WHERE task_id = ?", (task_id,))
conn.commit()
progress_store = ProgressStore()

View File

@@ -0,0 +1,4 @@
"""工具模块"""
from .date_parser import parse_aae397, parse_collect_time, is_within_days
__all__ = ["parse_aae397", "parse_collect_time", "is_within_days"]

View File

@@ -0,0 +1,74 @@
"""日期解析工具"""
import re
from datetime import datetime, timedelta
from typing import Optional
def parse_aae397(date_str: str) -> Optional[datetime]:
"""
解析发布日期字段 aae397
支持格式:
- "今天"
- "1月13日"
- "12月31日"
"""
if not date_str:
return None
date_str = date_str.strip()
today = datetime.now()
# 处理 "今天"
if date_str == "今天":
return today
# 处理 "X月X日" 格式
pattern = r"(\d{1,2})月(\d{1,2})日"
match = re.match(pattern, date_str)
if match:
month = int(match.group(1))
day = int(match.group(2))
year = today.year
try:
parsed_date = datetime(year, month, day)
if parsed_date > today:
parsed_date = datetime(year - 1, month, day)
return parsed_date
except ValueError:
return None
return None
def parse_collect_time(date_str: str) -> Optional[datetime]:
"""
解析采集时间字段 Collect_time
格式: "2026-01-15"
"""
if not date_str:
return None
try:
return datetime.strptime(date_str.strip(), "%Y-%m-%d")
except ValueError:
return None
def is_within_days(date_str: str, collect_time_str: str, days: int = 7) -> bool:
"""
判断数据是否在指定天数内
条件: 发布日期 AND 采集时间 都在指定天数内
"""
today = datetime.now()
cutoff_date = today - timedelta(days=days)
publish_date = parse_aae397(date_str)
if publish_date is None:
return False
collect_date = parse_collect_time(collect_time_str)
if collect_date is None:
return False
return publish_date >= cutoff_date and collect_date >= cutoff_date

View File

@@ -0,0 +1,41 @@
# 招聘数据采集服务配置文件
# 应用配置
app:
name: job-crawler
version: 1.0.0
debug: false
# 八爪鱼API配置
api:
base_url: https://openapi.bazhuayu.com
username: "13051331101"
password: "abc19910515"
batch_size: 100
# 多任务配置
tasks:
- id: "00f3b445-d8ec-44e8-88b2-4b971a228b1e"
name: "青岛招聘数据"
enabled: true
- id: "task-id-2"
name: "任务2"
enabled: false
- id: "task-id-3"
name: "任务3"
enabled: false
# Kafka配置
kafka:
bootstrap_servers: localhost:9092
topic: job_data
consumer_group: job_consumer_group
# 采集配置
crawler:
interval: 300 # 采集间隔(秒)
filter_days: 7 # 过滤天数
max_workers: 5 # 最大并行任务数
# 数据库配置
database:
path: data/crawl_progress.db

View File

@@ -0,0 +1,39 @@
# Docker环境配置文件
# 复制此文件为 config.yml 并修改账号密码
# 应用配置
app:
name: job-crawler
version: 1.0.0
debug: false
# 八爪鱼API配置
api:
base_url: https://openapi.bazhuayu.com
username: "your_username"
password: "your_password"
batch_size: 100
# 多任务配置
tasks:
- id: "00f3b445-d8ec-44e8-88b2-4b971a228b1e"
name: "青岛招聘数据"
enabled: true
- id: "task-id-2"
name: "任务2"
enabled: false
# Kafka配置Docker内部网络
kafka:
bootstrap_servers: kafka:29092
topic: job_data
consumer_group: job_consumer_group
# 采集配置
crawler:
interval: 300
filter_days: 7
max_workers: 5
# 数据库配置
database:
path: /app/data/crawl_progress.db

View File

@@ -0,0 +1,75 @@
version: '3.8'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
container_name: job-zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
volumes:
- zookeeper_data:/var/lib/zookeeper/data
healthcheck:
test: ["CMD", "nc", "-z", "localhost", "2181"]
interval: 10s
timeout: 5s
retries: 5
networks:
- job-network
kafka:
image: confluentinc/cp-kafka:7.5.0
container_name: job-kafka
ports:
- "9092:9092"
- "29092:29092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
volumes:
- kafka_data:/var/lib/kafka/data
depends_on:
zookeeper:
condition: service_healthy
healthcheck:
test: ["CMD", "kafka-topics", "--bootstrap-server", "localhost:9092", "--list"]
interval: 10s
timeout: 10s
retries: 5
networks:
- job-network
app:
build:
context: .
dockerfile: Dockerfile
container_name: job-crawler
ports:
- "8000:8000"
environment:
- CONFIG_PATH=/app/config/config.yml
volumes:
- ./config:/app/config:ro
- app_data:/app/data
depends_on:
kafka:
condition: service_healthy
restart: unless-stopped
networks:
- job-network
networks:
job-network:
driver: bridge
volumes:
zookeeper_data:
kafka_data:
app_data:

View File

@@ -0,0 +1,8 @@
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.27.0
kafka-python==2.0.2
apscheduler==3.10.4
pydantic==2.5.3
python-dotenv==1.0.0
PyYAML==6.0.1