Files

Roog (顾新培) 5921f71756 main:添加核心文件并初始化项目

新增内容：
- 创建基础项目结构。
- 添加 `.gitignore` 和 `.dockerignore` 文件。
- 编写 `pyproject.toml` 和依赖文件。
- 添加算法模块及示例算法。
- 实现核心功能模块（日志、错误处理、指标）。
- 添加开发和运行所需的相关脚本文件及文档。

2026-02-03 18:38:08 +08:00

7.6 KiB

Raw Blame History

指标记录方案对比与使用指南

问题背景

在多实例部署场景下（Kubernetes、Serverless），原有的内存指标存储方案存在以下问题：

指标分散：每个实例独立记录指标，无法聚合
数据丢失：实例销毁后指标丢失
统计不准：无法获得全局准确的指标视图

解决方案对比

方案1：Pushgateway（推荐）

原理： 应用主动推送指标到 Pushgateway，Prometheus 从 Pushgateway 抓取

优点：

✅ Prometheus 官方支持，生态成熟
✅ 实现简单，代码改动小
✅ 适合短生命周期任务（Serverless、批处理）
✅ 支持持久化，重启不丢失数据

缺点：

⚠️ 单点故障风险（可通过高可用部署解决）
⚠️ 不适合超高频推送（每秒数千次）

适用场景：

Serverless 函数
批处理任务
短生命周期容器
实例数量动态变化的场景

方案2：Redis + 自定义 Exporter

原理： 应用将指标写入 Redis，自定义 Exporter 从 Redis 读取并转换为 Prometheus 格式

优点：

✅ 灵活可控，支持复杂聚合逻辑
✅ Redis 高性能，支持高并发写入
✅ 可以实现自定义的指标计算

缺点：

⚠️ 需要自己实现 Exporter，维护成本高
⚠️ 增加了系统复杂度
⚠️ Redis 需要额外的运维成本

适用场景：

需要自定义指标聚合逻辑
超高频指标写入（每秒数万次）
需要实时查询指标数据

方案3：标准 Prometheus Pull 模式（不推荐）

原理： Prometheus 从每个实例抓取指标，在查询时聚合

优点：

✅ Prometheus 标准做法
✅ 无需额外组件

缺点：

❌ 需要服务发现机制（Kubernetes Service Discovery）
❌ 短生命周期实例可能来不及抓取
❌ 实例销毁后数据丢失

适用场景：

长生命周期服务
实例数量相对固定
有完善的服务发现机制

使用指南

方案1：Pushgateway（推荐）

1. 启动服务

cd deployment
docker-compose up -d redis pushgateway prometheus grafana

2. 修改代码

在 src/functional_scaffold/api/routes.py 中：

# 替换导入
from functional_scaffold.core.metrics_pushgateway import (
    track_request,
    track_algorithm_execution,
)

# 使用方式不变
@router.post("/invoke")
@track_request("POST", "/invoke")
async def invoke_algorithm(request: InvokeRequest):
    # ... 业务逻辑

3. 配置环境变量

在 .env 文件中：

PUSHGATEWAY_URL=localhost:9091
METRICS_JOB_NAME=functional_scaffold
INSTANCE_ID=instance-1  # 可选，默认使用 HOSTNAME

4. 验证

# 查看 Pushgateway 指标
curl http://localhost:9091/metrics

# 查看 Prometheus
open http://localhost:9090

# 查询示例
http_requests_total{job="functional_scaffold"}

方案2：Redis + Exporter

1. 启动服务

cd deployment
docker-compose up -d redis redis-exporter prometheus grafana

2. 修改代码

在 src/functional_scaffold/api/routes.py 中：

# 替换导入
from functional_scaffold.core.metrics_redis import (
    track_request,
    track_algorithm_execution,
)

# 使用方式不变
@router.post("/invoke")
@track_request("POST", "/invoke")
async def invoke_algorithm(request: InvokeRequest):
    # ... 业务逻辑

3. 配置环境变量

在 .env 文件中：

REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_METRICS_DB=0
REDIS_PASSWORD=  # 可选
INSTANCE_ID=instance-1  # 可选

4. 安装 Redis 依赖

pip install redis

或在 requirements.txt 中添加：

redis>=5.0.0

5. 验证

# 查看 Redis 中的指标
redis-cli
> HGETALL metrics:request_counter

# 查看 Exporter 输出
curl http://localhost:8001/metrics

# 查看 Prometheus
open http://localhost:9090

性能对比

指标	Pushgateway	Redis + Exporter	标准 Pull
写入延迟	~5ms	~1ms	N/A
查询延迟	~10ms	~20ms	~5ms
吞吐量	~1000 req/s	~10000 req/s	~500 req/s
内存占用	低	中	低
复杂度	低	高	低

迁移步骤

从原有方案迁移到 Pushgateway

安装依赖（如果需要）：
```
pip install prometheus-client
```

替换导入：

# 旧代码
from functional_scaffold.core.metrics import track_request

# 新代码
from functional_scaffold.core.metrics_pushgateway import track_request

配置环境变量：
```
export PUSHGATEWAY_URL=localhost:9091
```
启动 Pushgateway：
```
docker-compose up -d pushgateway
```
更新 Prometheus 配置（已包含在 monitoring/prometheus.yml）

测试验证：

# 发送请求
curl -X POST http://localhost:8000/invoke -d '{"number": 17}'

# 查看指标
curl http://localhost:9091/metrics | grep http_requests_total

从原有方案迁移到 Redis

安装依赖：
```
pip install redis
```

替换导入：

# 旧代码
from functional_scaffold.core.metrics import track_request

# 新代码
from functional_scaffold.core.metrics_redis import track_request

配置环境变量：

export REDIS_HOST=localhost
export REDIS_PORT=6379

启动 Redis 和 Exporter：

docker-compose up -d redis redis-exporter

测试验证：

# 发送请求
curl -X POST http://localhost:8000/invoke -d '{"number": 17}'

# 查看 Redis
redis-cli HGETALL metrics:request_counter

# 查看 Exporter
curl http://localhost:8001/metrics

常见问题

Q1: Pushgateway 会成为单点故障吗？

A: 可以通过以下方式解决：

部署多个 Pushgateway 实例（负载均衡）
使用持久化存储（已配置）
推送失败时降级到本地日志

Q2: Redis 方案的性能如何？

A: Redis 单实例可以支持 10万+ QPS，对于大多数场景足够。如果需要更高性能，可以：

使用 Redis Cluster
批量写入（减少网络往返）
使用 Pipeline

Q3: 如何在 Kubernetes 中使用？

Pushgateway: 部署为 Service，应用通过 Service 名称访问
Redis: 使用 StatefulSet 或托管 Redis 服务

Q4: 指标数据会丢失吗？

Pushgateway: 支持持久化，重启不丢失
Redis: 配置了 AOF 持久化，重启不丢失
标准 Pull: 实例销毁后丢失

Q5: 如何选择方案？

建议：

Serverless/短生命周期 → Pushgateway
超高并发/自定义逻辑 → Redis
长生命周期/K8s → 标准 Pull（需配置服务发现）

监控和告警

Grafana 仪表板

访问 http://localhost:3000（admin/admin）

已预配置的面板：

HTTP 请求总数
HTTP 请求延迟（P50/P95/P99）
算法执行次数
算法执行延迟
错误率

告警规则

在 monitoring/alerts/rules.yaml 中配置：

groups:
  - name: functional_scaffold
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status="error"}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "高错误率告警"
          description: "错误率超过 5%"

7.6 KiB Raw Blame History Unescape Escape

指标记录方案对比与使用指南

问题背景

解决方案对比

方案1：Pushgateway（推荐）

方案2：Redis + 自定义 Exporter

方案3：标准 Prometheus Pull 模式（不推荐）

使用指南

方案1：Pushgateway（推荐）

1. 启动服务

2. 修改代码

3. 配置环境变量

4. 验证

方案2：Redis + Exporter

1. 启动服务

2. 修改代码

3. 配置环境变量

4. 安装 Redis 依赖

5. 验证

性能对比

迁移步骤

从原有方案迁移到 Pushgateway

从原有方案迁移到 Redis

常见问题

Q1: Pushgateway 会成为单点故障吗？

Q2: Redis 方案的性能如何？

Q3: 如何在 Kubernetes 中使用？

Q4: 指标数据会丢失吗？

Q5: 如何选择方案？

监控和告警

Grafana 仪表板

告警规则

参考资料

7.6 KiB

Raw Blame History