# 指标记录方案对比与使用指南

## 问题背景

在多实例部署场景下（Kubernetes、Serverless），原有的内存指标存储方案存在以下问题：

1. **指标分散**：每个实例独立记录指标，无法聚合
2. **数据丢失**：实例销毁后指标丢失
3. **统计不准**：无法获得全局准确的指标视图

## 解决方案对比

### 方案1：Pushgateway（推荐）

**原理：** 应用主动推送指标到 Pushgateway，Prometheus 从 Pushgateway 抓取

**优点：**
- ✅ Prometheus 官方支持，生态成熟
- ✅ 实现简单，代码改动小
- ✅ 适合短生命周期任务（Serverless、批处理）
- ✅ 支持持久化，重启不丢失数据

**缺点：**
- ⚠️ 单点故障风险（可通过高可用部署解决）
- ⚠️ 不适合超高频推送（每秒数千次）

**适用场景：**
- Serverless 函数
- 批处理任务
- 短生命周期容器
- 实例数量动态变化的场景

### 方案2：Redis + 自定义 Exporter

**原理：** 应用将指标写入 Redis，自定义 Exporter 从 Redis 读取并转换为 Prometheus 格式

**优点：**
- ✅ 灵活可控，支持复杂聚合逻辑
- ✅ Redis 高性能，支持高并发写入
- ✅ 可以实现自定义的指标计算

**缺点：**
- ⚠️ 需要自己实现 Exporter，维护成本高
- ⚠️ 增加了系统复杂度
- ⚠️ Redis 需要额外的运维成本

**适用场景：**
- 需要自定义指标聚合逻辑
- 超高频指标写入（每秒数万次）
- 需要实时查询指标数据

### 方案3：标准 Prometheus Pull 模式（不推荐）

**原理：** Prometheus 从每个实例抓取指标，在查询时聚合

**优点：**
- ✅ Prometheus 标准做法
- ✅ 无需额外组件

**缺点：**
- ❌ 需要服务发现机制（Kubernetes Service Discovery）
- ❌ 短生命周期实例可能来不及抓取
- ❌ 实例销毁后数据丢失

**适用场景：**
- 长生命周期服务
- 实例数量相对固定
- 有完善的服务发现机制

## 使用指南

### 方案1：Pushgateway（推荐）

#### 1. 启动服务

```bash
cd deployment
docker-compose up -d redis pushgateway prometheus grafana
```

#### 2. 修改代码

在 `src/functional_scaffold/api/routes.py` 中：

```python
# 替换导入
from functional_scaffold.core.metrics_pushgateway import (
    track_request,
    track_algorithm_execution,
)

# 使用方式不变
@router.post("/invoke")
@track_request("POST", "/invoke")
async def invoke_algorithm(request: InvokeRequest):
    # ... 业务逻辑
```

#### 3. 配置环境变量

在 `.env` 文件中：

```bash
PUSHGATEWAY_URL=localhost:9091
METRICS_JOB_NAME=functional_scaffold
INSTANCE_ID=instance-1  # 可选，默认使用 HOSTNAME
```

#### 4. 验证

```bash
# 查看 Pushgateway 指标
curl http://localhost:9091/metrics

# 查看 Prometheus
open http://localhost:9090

# 查询示例
http_requests_total{job="functional_scaffold"}
```

### 方案2：Redis + Exporter

#### 1. 启动服务

```bash
cd deployment
docker-compose up -d redis redis-exporter prometheus grafana
```

#### 2. 修改代码

在 `src/functional_scaffold/api/routes.py` 中：

```python
# 替换导入
from functional_scaffold.core.metrics_redis import (
    track_request,
    track_algorithm_execution,
)

# 使用方式不变
@router.post("/invoke")
@track_request("POST", "/invoke")
async def invoke_algorithm(request: InvokeRequest):
    # ... 业务逻辑
```

#### 3. 配置环境变量

在 `.env` 文件中：

```bash
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_METRICS_DB=0
REDIS_PASSWORD=  # 可选
INSTANCE_ID=instance-1  # 可选
```

#### 4. 安装 Redis 依赖

```bash
pip install redis
```

或在 `requirements.txt` 中添加：

```
redis>=5.0.0
```

#### 5. 验证

```bash
# 查看 Redis 中的指标
redis-cli
> HGETALL metrics:request_counter

# 查看 Exporter 输出
curl http://localhost:8001/metrics

# 查看 Prometheus
open http://localhost:9090
```

## 性能对比

| 指标 | Pushgateway | Redis + Exporter | 标准 Pull |
|------|-------------|------------------|-----------|
| 写入延迟 | ~5ms | ~1ms | N/A |
| 查询延迟 | ~10ms | ~20ms | ~5ms |
| 吞吐量 | ~1000 req/s | ~10000 req/s | ~500 req/s |
| 内存占用 | 低 | 中 | 低 |
| 复杂度 | 低 | 高 | 低 |

## 迁移步骤

### 从原有方案迁移到 Pushgateway

1. **安装依赖**（如果需要）：
   ```bash
   pip install prometheus-client
   ```

2. **替换导入**：
   ```python
   # 旧代码
   from functional_scaffold.core.metrics import track_request

   # 新代码
   from functional_scaffold.core.metrics_pushgateway import track_request
   ```

3. **配置环境变量**：
   ```bash
   export PUSHGATEWAY_URL=localhost:9091
   ```

4. **启动 Pushgateway**：
   ```bash
   docker-compose up -d pushgateway
   ```

5. **更新 Prometheus 配置**（已包含在 `monitoring/prometheus.yml`）

6. **测试验证**：
   ```bash
   # 发送请求
   curl -X POST http://localhost:8000/invoke -d '{"number": 17}'

   # 查看指标
   curl http://localhost:9091/metrics | grep http_requests_total
   ```

### 从原有方案迁移到 Redis

1. **安装依赖**：
   ```bash
   pip install redis
   ```

2. **替换导入**：
   ```python
   # 旧代码
   from functional_scaffold.core.metrics import track_request

   # 新代码
   from functional_scaffold.core.metrics_redis import track_request
   ```

3. **配置环境变量**：
   ```bash
   export REDIS_HOST=localhost
   export REDIS_PORT=6379
   ```

4. **启动 Redis 和 Exporter**：
   ```bash
   docker-compose up -d redis redis-exporter
   ```

5. **测试验证**：
   ```bash
   # 发送请求
   curl -X POST http://localhost:8000/invoke -d '{"number": 17}'

   # 查看 Redis
   redis-cli HGETALL metrics:request_counter

   # 查看 Exporter
   curl http://localhost:8001/metrics
   ```

## 常见问题

### Q1: Pushgateway 会成为单点故障吗？

A: 可以通过以下方式解决：
- 部署多个 Pushgateway 实例（负载均衡）
- 使用持久化存储（已配置）
- 推送失败时降级到本地日志

### Q2: Redis 方案的性能如何？

A: Redis 单实例可以支持 10万+ QPS，对于大多数场景足够。如果需要更高性能，可以：
- 使用 Redis Cluster
- 批量写入（减少网络往返）
- 使用 Pipeline

### Q3: 如何在 Kubernetes 中使用？

A:
- **Pushgateway**: 部署为 Service，应用通过 Service 名称访问
- **Redis**: 使用 StatefulSet 或托管 Redis 服务

### Q4: 指标数据会丢失吗？

A:
- **Pushgateway**: 支持持久化，重启不丢失
- **Redis**: 配置了 AOF 持久化，重启不丢失
- **标准 Pull**: 实例销毁后丢失

### Q5: 如何选择方案？

建议：
- **Serverless/短生命周期** → Pushgateway
- **超高并发/自定义逻辑** → Redis
- **长生命周期/K8s** → 标准 Pull（需配置服务发现）

## 监控和告警

### Grafana 仪表板

访问 http://localhost:3000（admin/admin）

已预配置的面板：
- HTTP 请求总数
- HTTP 请求延迟（P50/P95/P99）
- 算法执行次数
- 算法执行延迟
- 错误率

### 告警规则

在 `monitoring/alerts/rules.yaml` 中配置：

```yaml
groups:
  - name: functional_scaffold
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status="error"}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "高错误率告警"
          description: "错误率超过 5%"
```

## 参考资料

- [Prometheus Pushgateway 文档](https://github.com/prometheus/pushgateway)
- [Prometheus 最佳实践](https://prometheus.io/docs/practices/)
- [Redis 官方文档](https://redis.io/documentation)