main:删除多余文档并清理项目目录
变更内容: - 移除冗余文档,包括 Grafana 指南、指标对比、修复总结、OpenAPI 规范等。 - 精简项目文档结构,优化 README 文件内容。 - 提升文档层次清晰度,集中核心指南。
This commit is contained in:
337
docs/monitoring.md
Normal file
337
docs/monitoring.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# 监控指南
|
||||
|
||||
本文档介绍 FunctionalScaffold 的监控体系,包括指标收集、可视化和告警配置。
|
||||
|
||||
## 监控架构
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ 应用实例 1 │ │ 应用实例 2 │ │ 应用实例 N │
|
||||
│ /metrics 端点 │ │ /metrics 端点 │ │ /metrics 端点 │
|
||||
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
|
||||
│ │ │
|
||||
│ 写入指标到 Redis │ │
|
||||
└───────────────────────┼───────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ Redis │
|
||||
│ (指标聚合存储) │
|
||||
└────────────┬────────────┘
|
||||
│
|
||||
│ 读取并导出
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ Prometheus │
|
||||
│ (抓取 /metrics) │
|
||||
└────────────┬────────────┘
|
||||
│
|
||||
│ 查询
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ Grafana │
|
||||
│ (可视化展示) │
|
||||
└─────────────────────────┘
|
||||
```
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 启动监控服务
|
||||
|
||||
```bash
|
||||
cd deployment
|
||||
docker-compose up -d redis prometheus grafana
|
||||
```
|
||||
|
||||
### 访问地址
|
||||
|
||||
| 服务 | 地址 | 默认账号 |
|
||||
|------|------|---------|
|
||||
| 应用 Metrics | http://localhost:8000/metrics | - |
|
||||
| Prometheus | http://localhost:9090 | - |
|
||||
| Grafana | http://localhost:3000 | admin/admin |
|
||||
|
||||
## 指标说明
|
||||
|
||||
### HTTP 请求指标
|
||||
|
||||
| 指标 | 类型 | 标签 | 描述 |
|
||||
|------|------|------|------|
|
||||
| `http_requests_total` | Counter | method, endpoint, status | HTTP 请求总数 |
|
||||
| `http_request_duration_seconds` | Histogram | method, endpoint | HTTP 请求延迟分布 |
|
||||
| `http_requests_in_progress` | Gauge | - | 当前进行中的请求数 |
|
||||
|
||||
### 算法执行指标
|
||||
|
||||
| 指标 | 类型 | 标签 | 描述 |
|
||||
|------|------|------|------|
|
||||
| `algorithm_executions_total` | Counter | algorithm, status | 算法执行总数 |
|
||||
| `algorithm_execution_duration_seconds` | Histogram | algorithm | 算法执行延迟分布 |
|
||||
|
||||
### 异步任务指标
|
||||
|
||||
| 指标 | 类型 | 标签 | 描述 |
|
||||
|------|------|------|------|
|
||||
| `jobs_created_total` | Counter | algorithm | 创建的任务总数 |
|
||||
| `jobs_completed_total` | Counter | algorithm, status | 完成的任务总数 |
|
||||
| `job_execution_duration_seconds` | Histogram | algorithm | 任务执行时间分布 |
|
||||
| `webhook_deliveries_total` | Counter | status | Webhook 发送总数 |
|
||||
|
||||
## Prometheus 查询示例
|
||||
|
||||
### 基础查询
|
||||
|
||||
```promql
|
||||
# 每秒请求数 (QPS)
|
||||
rate(http_requests_total[5m])
|
||||
|
||||
# 按端点分组的 QPS
|
||||
sum(rate(http_requests_total[5m])) by (endpoint)
|
||||
|
||||
# 请求成功率
|
||||
sum(rate(http_requests_total{status="success"}[5m]))
|
||||
/ sum(rate(http_requests_total[5m]))
|
||||
|
||||
# 当前并发请求数
|
||||
http_requests_in_progress
|
||||
```
|
||||
|
||||
### 延迟分析
|
||||
|
||||
```promql
|
||||
# P50 延迟
|
||||
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
|
||||
|
||||
# P95 延迟
|
||||
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
||||
|
||||
# P99 延迟
|
||||
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
|
||||
|
||||
# 平均延迟
|
||||
rate(http_request_duration_seconds_sum[5m])
|
||||
/ rate(http_request_duration_seconds_count[5m])
|
||||
```
|
||||
|
||||
### 算法分析
|
||||
|
||||
```promql
|
||||
# 算法执行速率
|
||||
sum(rate(algorithm_executions_total[5m])) by (algorithm)
|
||||
|
||||
# 算法失败率
|
||||
sum(rate(algorithm_executions_total{status="error"}[5m]))
|
||||
/ sum(rate(algorithm_executions_total[5m]))
|
||||
|
||||
# 算法 P95 延迟
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(algorithm_execution_duration_seconds_bucket[5m])) by (le, algorithm)
|
||||
)
|
||||
```
|
||||
|
||||
### 异步任务分析
|
||||
|
||||
```promql
|
||||
# 任务创建速率
|
||||
sum(rate(jobs_created_total[5m])) by (algorithm)
|
||||
|
||||
# 任务成功率
|
||||
sum(rate(jobs_completed_total{status="completed"}[5m]))
|
||||
/ sum(rate(jobs_completed_total[5m]))
|
||||
|
||||
# 任务积压(创建速率 - 完成速率)
|
||||
sum(rate(jobs_created_total[5m])) - sum(rate(jobs_completed_total[5m]))
|
||||
|
||||
# Webhook 成功率
|
||||
sum(rate(webhook_deliveries_total{status="success"}[5m]))
|
||||
/ sum(rate(webhook_deliveries_total[5m]))
|
||||
```
|
||||
|
||||
## Grafana 仪表板
|
||||
|
||||
### 导入仪表板
|
||||
|
||||
1. 打开 Grafana: http://localhost:3000
|
||||
2. 登录(admin/admin)
|
||||
3. 进入 **Dashboards** → **Import**
|
||||
4. 上传文件:`monitoring/grafana/dashboard.json`
|
||||
5. 选择 Prometheus 数据源
|
||||
6. 点击 **Import**
|
||||
|
||||
### 仪表板面板
|
||||
|
||||
#### HTTP 监控区域
|
||||
- **HTTP 请求速率 (QPS)** - 每秒请求数趋势
|
||||
- **HTTP 请求延迟** - P50/P95/P99 延迟趋势
|
||||
- **请求成功率** - 成功率仪表盘
|
||||
- **当前并发请求数** - 实时并发数
|
||||
- **HTTP 请求总数** - 累计请求数
|
||||
- **请求分布** - 按端点/状态的饼图
|
||||
|
||||
#### 算法监控区域
|
||||
- **算法执行速率** - 每秒执行次数
|
||||
- **算法执行延迟** - P50/P95/P99 延迟
|
||||
- **算法执行总数** - 累计执行数
|
||||
|
||||
#### 异步任务监控区域
|
||||
- **任务创建总数** - 累计创建的任务数
|
||||
- **任务完成总数** - 累计完成的任务数
|
||||
- **任务失败总数** - 累计失败的任务数
|
||||
- **任务成功率** - 成功率仪表盘
|
||||
- **异步任务速率** - 创建和完成速率趋势
|
||||
- **异步任务执行延迟** - P50/P95/P99 延迟
|
||||
- **任务状态分布** - 按状态的饼图
|
||||
- **Webhook 发送状态** - 成功/失败分布
|
||||
|
||||
## 告警配置
|
||||
|
||||
### 告警规则
|
||||
|
||||
告警规则定义在 `monitoring/alerts/rules.yaml`:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: functional_scaffold_alerts
|
||||
interval: 30s
|
||||
rules:
|
||||
# 高错误率告警
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status="error"}[5m]) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "检测到高错误率"
|
||||
description: "端点 {{ $labels.endpoint }} 的错误率为 {{ $value }} 请求/秒"
|
||||
|
||||
# 高延迟告警
|
||||
- alert: HighLatency
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "检测到高延迟"
|
||||
description: "端点 {{ $labels.endpoint }} 的 P95 延迟为 {{ $value }}s"
|
||||
|
||||
# 服务不可用告警
|
||||
- alert: ServiceDown
|
||||
expr: up{job="functional-scaffold"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "服务不可用"
|
||||
description: "FunctionalScaffold 服务已停止超过 1 分钟"
|
||||
|
||||
# 异步任务失败率告警
|
||||
- alert: HighJobFailureRate
|
||||
expr: rate(jobs_completed_total{status="failed"}[5m]) / rate(jobs_completed_total[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "异步任务失败率过高"
|
||||
description: "算法 {{ $labels.algorithm }} 的异步任务失败率超过 10%"
|
||||
|
||||
# 任务积压告警
|
||||
- alert: JobBacklog
|
||||
expr: sum(rate(jobs_created_total[5m])) - sum(rate(jobs_completed_total[5m])) > 10
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "异步任务积压"
|
||||
description: "任务创建速率超过完成速率,可能存在积压"
|
||||
```
|
||||
|
||||
### 告警级别
|
||||
|
||||
| 级别 | 描述 | 响应时间 |
|
||||
|------|------|---------|
|
||||
| critical | 严重告警,服务不可用 | 立即响应 |
|
||||
| warning | 警告,性能下降或异常 | 1 小时内响应 |
|
||||
| info | 信息,需要关注 | 工作时间内响应 |
|
||||
|
||||
## 自定义指标
|
||||
|
||||
### 添加新指标
|
||||
|
||||
1. 在 `config/metrics.yaml` 中定义:
|
||||
|
||||
```yaml
|
||||
custom_metrics:
|
||||
my_custom_counter:
|
||||
name: "my_custom_counter"
|
||||
type: counter
|
||||
description: "我的自定义计数器"
|
||||
labels: [label1, label2]
|
||||
|
||||
my_custom_histogram:
|
||||
name: "my_custom_histogram"
|
||||
type: histogram
|
||||
description: "我的自定义直方图"
|
||||
labels: [label1]
|
||||
buckets: [0.1, 0.5, 1, 5, 10]
|
||||
```
|
||||
|
||||
2. 在代码中使用:
|
||||
|
||||
```python
|
||||
from functional_scaffold.core.metrics_unified import incr, observe
|
||||
|
||||
# 增加计数器
|
||||
incr("my_custom_counter", {"label1": "value1", "label2": "value2"})
|
||||
|
||||
# 记录直方图
|
||||
observe("my_custom_histogram", {"label1": "value1"}, 0.5)
|
||||
```
|
||||
|
||||
## 故障排查
|
||||
|
||||
### 指标不显示
|
||||
|
||||
1. 检查应用 metrics 端点:
|
||||
```bash
|
||||
curl http://localhost:8000/metrics
|
||||
```
|
||||
|
||||
2. 检查 Redis 连接:
|
||||
```bash
|
||||
redis-cli ping
|
||||
```
|
||||
|
||||
3. 检查 Prometheus 抓取状态:
|
||||
- 访问 http://localhost:9090/targets
|
||||
- 确认 functional-scaffold 目标状态为 UP
|
||||
|
||||
### Grafana 无数据
|
||||
|
||||
1. 检查数据源配置:
|
||||
- URL 应为 `http://prometheus:9090`(容器内部)
|
||||
- 不是 `http://localhost:9090`
|
||||
|
||||
2. 检查时间范围:
|
||||
- 确保选择了正确的时间范围
|
||||
- 尝试 "Last 5 minutes"
|
||||
|
||||
3. 生成测试流量:
|
||||
```bash
|
||||
./scripts/generate_traffic.sh
|
||||
```
|
||||
|
||||
### 告警不触发
|
||||
|
||||
1. 检查 Prometheus 规则加载:
|
||||
- 访问 http://localhost:9090/rules
|
||||
- 确认规则已加载
|
||||
|
||||
2. 检查告警状态:
|
||||
- 访问 http://localhost:9090/alerts
|
||||
- 查看告警是否处于 pending 或 firing 状态
|
||||
|
||||
## 参考资料
|
||||
|
||||
- [Prometheus 文档](https://prometheus.io/docs/)
|
||||
- [Grafana 文档](https://grafana.com/docs/)
|
||||
- [PromQL 查询语言](https://prometheus.io/docs/prometheus/latest/querying/basics/)
|
||||
Reference in New Issue
Block a user