main:删除多余文档并清理项目目录

变更内容： - 移除冗余文档，包括 Grafana 指南、指标对比、修复总结、OpenAPI 规范等。 - 精简项目文档结构，优化 README 文件内容。 - 提升文档层次清晰度，集中核心指南。
2026-02-02 14:59:34 +08:00
parent 241cffebc2
commit b1077e78e9
23 changed files with 3763 additions and 960 deletions
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -0,0 +1,337 @@
+# 监控指南
+
+本文档介绍 FunctionalScaffold 的监控体系，包括指标收集、可视化和告警配置。
+
+## 监控架构
+
+```
+┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
+│   应用实例 1    │     │   应用实例 2    │     │   应用实例 N    │
+│  /metrics 端点  │     │  /metrics 端点  │     │  /metrics 端点  │
+└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
+         │                       │                       │
+         │    写入指标到 Redis   │                       │
+         └───────────────────────┼───────────────────────┘
+                                 │
+                                 ▼
+                    ┌─────────────────────────┐
+                    │         Redis           │
+                    │    (指标聚合存储)        │
+                    └────────────┬────────────┘
+                                 │
+                                 │ 读取并导出
+                                 ▼
+                    ┌─────────────────────────┐
+                    │      Prometheus         │
+                    │    (抓取 /metrics)      │
+                    └────────────┬────────────┘
+                                 │
+                                 │ 查询
+                                 ▼
+                    ┌─────────────────────────┐
+                    │        Grafana          │
+                    │      (可视化展示)        │
+                    └─────────────────────────┘
+```
+
+## 快速开始
+
+### 启动监控服务
+
+```bash
+cd deployment
+docker-compose up -d redis prometheus grafana
+```
+
+### 访问地址
+
+| 服务 | 地址 | 默认账号 |
+|------|------|---------|
+| 应用 Metrics | http://localhost:8000/metrics | - |
+| Prometheus | http://localhost:9090 | - |
+| Grafana | http://localhost:3000 | admin/admin |
+
+## 指标说明
+
+### HTTP 请求指标
+
+| 指标 | 类型 | 标签 | 描述 |
+|------|------|------|------|
+| `http_requests_total` | Counter | method, endpoint, status | HTTP 请求总数 |
+| `http_request_duration_seconds` | Histogram | method, endpoint | HTTP 请求延迟分布 |
+| `http_requests_in_progress` | Gauge | - | 当前进行中的请求数 |
+
+### 算法执行指标
+
+| 指标 | 类型 | 标签 | 描述 |
+|------|------|------|------|
+| `algorithm_executions_total` | Counter | algorithm, status | 算法执行总数 |
+| `algorithm_execution_duration_seconds` | Histogram | algorithm | 算法执行延迟分布 |
+
+### 异步任务指标
+
+| 指标 | 类型 | 标签 | 描述 |
+|------|------|------|------|
+| `jobs_created_total` | Counter | algorithm | 创建的任务总数 |
+| `jobs_completed_total` | Counter | algorithm, status | 完成的任务总数 |
+| `job_execution_duration_seconds` | Histogram | algorithm | 任务执行时间分布 |
+| `webhook_deliveries_total` | Counter | status | Webhook 发送总数 |
+
+## Prometheus 查询示例
+
+### 基础查询
+
+```promql
+# 每秒请求数 (QPS)
+rate(http_requests_total[5m])
+
+# 按端点分组的 QPS
+sum(rate(http_requests_total[5m])) by (endpoint)
+
+# 请求成功率
+sum(rate(http_requests_total{status="success"}[5m]))
+/ sum(rate(http_requests_total[5m]))
+
+# 当前并发请求数
+http_requests_in_progress
+```
+
+### 延迟分析
+
+```promql
+# P50 延迟
+histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
+
+# P95 延迟
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+
+# P99 延迟
+histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
+
+# 平均延迟
+rate(http_request_duration_seconds_sum[5m])
+/ rate(http_request_duration_seconds_count[5m])
+```
+
+### 算法分析
+
+```promql
+# 算法执行速率
+sum(rate(algorithm_executions_total[5m])) by (algorithm)
+
+# 算法失败率
+sum(rate(algorithm_executions_total{status="error"}[5m]))
+/ sum(rate(algorithm_executions_total[5m]))
+
+# 算法 P95 延迟
+histogram_quantile(0.95,
+  sum(rate(algorithm_execution_duration_seconds_bucket[5m])) by (le, algorithm)
+)
+```
+
+### 异步任务分析
+
+```promql
+# 任务创建速率
+sum(rate(jobs_created_total[5m])) by (algorithm)
+
+# 任务成功率
+sum(rate(jobs_completed_total{status="completed"}[5m]))
+/ sum(rate(jobs_completed_total[5m]))
+
+# 任务积压（创建速率 - 完成速率）
+sum(rate(jobs_created_total[5m])) - sum(rate(jobs_completed_total[5m]))
+
+# Webhook 成功率
+sum(rate(webhook_deliveries_total{status="success"}[5m]))
+/ sum(rate(webhook_deliveries_total[5m]))
+```
+
+## Grafana 仪表板
+
+### 导入仪表板
+
+1. 打开 Grafana: http://localhost:3000
+2. 登录（admin/admin）
+3. 进入 **Dashboards** → **Import**
+4. 上传文件：`monitoring/grafana/dashboard.json`
+5. 选择 Prometheus 数据源
+6. 点击 **Import**
+
+### 仪表板面板
+
+#### HTTP 监控区域
+- **HTTP 请求速率 (QPS)** - 每秒请求数趋势
+- **HTTP 请求延迟** - P50/P95/P99 延迟趋势
+- **请求成功率** - 成功率仪表盘
+- **当前并发请求数** - 实时并发数
+- **HTTP 请求总数** - 累计请求数
+- **请求分布** - 按端点/状态的饼图
+
+#### 算法监控区域
+- **算法执行速率** - 每秒执行次数
+- **算法执行延迟** - P50/P95/P99 延迟
+- **算法执行总数** - 累计执行数
+
+#### 异步任务监控区域
+- **任务创建总数** - 累计创建的任务数
+- **任务完成总数** - 累计完成的任务数
+- **任务失败总数** - 累计失败的任务数
+- **任务成功率** - 成功率仪表盘
+- **异步任务速率** - 创建和完成速率趋势
+- **异步任务执行延迟** - P50/P95/P99 延迟
+- **任务状态分布** - 按状态的饼图
+- **Webhook 发送状态** - 成功/失败分布
+
+## 告警配置
+
+### 告警规则
+
+告警规则定义在 `monitoring/alerts/rules.yaml`：
+
+```yaml
+groups:
+  - name: functional_scaffold_alerts
+    interval: 30s
+    rules:
+      # 高错误率告警
+      - alert: HighErrorRate
+        expr: rate(http_requests_total{status="error"}[5m]) > 0.05
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "检测到高错误率"
+          description: "端点 {{ $labels.endpoint }} 的错误率为 {{ $value }} 请求/秒"
+
+      # 高延迟告警
+      - alert: HighLatency
+        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "检测到高延迟"
+          description: "端点 {{ $labels.endpoint }} 的 P95 延迟为 {{ $value }}s"
+
+      # 服务不可用告警
+      - alert: ServiceDown
+        expr: up{job="functional-scaffold"} == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "服务不可用"
+          description: "FunctionalScaffold 服务已停止超过 1 分钟"
+
+      # 异步任务失败率告警
+      - alert: HighJobFailureRate
+        expr: rate(jobs_completed_total{status="failed"}[5m]) / rate(jobs_completed_total[5m]) > 0.1
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "异步任务失败率过高"
+          description: "算法 {{ $labels.algorithm }} 的异步任务失败率超过 10%"
+
+      # 任务积压告警
+      - alert: JobBacklog
+        expr: sum(rate(jobs_created_total[5m])) - sum(rate(jobs_completed_total[5m])) > 10
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "异步任务积压"
+          description: "任务创建速率超过完成速率，可能存在积压"
+```
+
+### 告警级别
+
+| 级别 | 描述 | 响应时间 |
+|------|------|---------|
+| critical | 严重告警，服务不可用 | 立即响应 |
+| warning | 警告，性能下降或异常 | 1 小时内响应 |
+| info | 信息，需要关注 | 工作时间内响应 |
+
+## 自定义指标
+
+### 添加新指标
+
+1. 在 `config/metrics.yaml` 中定义：
+
+```yaml
+custom_metrics:
+  my_custom_counter:
+    name: "my_custom_counter"
+    type: counter
+    description: "我的自定义计数器"
+    labels: [label1, label2]
+
+  my_custom_histogram:
+    name: "my_custom_histogram"
+    type: histogram
+    description: "我的自定义直方图"
+    labels: [label1]
+    buckets: [0.1, 0.5, 1, 5, 10]
+```
+
+2. 在代码中使用：
+
+```python
+from functional_scaffold.core.metrics_unified import incr, observe
+
+# 增加计数器
+incr("my_custom_counter", {"label1": "value1", "label2": "value2"})
+
+# 记录直方图
+observe("my_custom_histogram", {"label1": "value1"}, 0.5)
+```
+
+## 故障排查
+
+### 指标不显示
+
+1. 检查应用 metrics 端点：
+   ```bash
+   curl http://localhost:8000/metrics
+   ```
+
+2. 检查 Redis 连接：
+   ```bash
+   redis-cli ping
+   ```
+
+3. 检查 Prometheus 抓取状态：
+   - 访问 http://localhost:9090/targets
+   - 确认 functional-scaffold 目标状态为 UP
+
+### Grafana 无数据
+
+1. 检查数据源配置：
+   - URL 应为 `http://prometheus:9090`（容器内部）
+   - 不是 `http://localhost:9090`
+
+2. 检查时间范围：
+   - 确保选择了正确的时间范围
+   - 尝试 "Last 5 minutes"
+
+3. 生成测试流量：
+   ```bash
+   ./scripts/generate_traffic.sh
+   ```
+
+### 告警不触发
+
+1. 检查 Prometheus 规则加载：
+   - 访问 http://localhost:9090/rules
+   - 确认规则已加载
+
+2. 检查告警状态：
+   - 访问 http://localhost:9090/alerts
+   - 查看告警是否处于 pending 或 firing 状态
+
+## 参考资料
+
+- [Prometheus 文档](https://prometheus.io/docs/)
+- [Grafana 文档](https://grafana.com/docs/)
+- [PromQL 查询语言](https://prometheus.io/docs/prometheus/latest/querying/basics/)