- Implemented policy_utils.py with helper functions for action selection, including epsilon-greedy support.

- Updated `requirements.txt` to relax PyTorch version constraint for better GPU compatibility. - Added detailed GPU setup instructions, new device fallback options, and command examples to `README.md`. - Developed a new script `plot_model_max_x_trend.py` for visualizing training trends, generating HTML/Markdown reports.
2026-02-13 16:11:38 +08:00
parent 71008dfb72
commit 2960ac1df5
11 changed files with 1294 additions and 32 deletions
--- a/mario-rl-mvp/README.md
+++ b/mario-rl-mvp/README.md
@@ -11,6 +11,7 @@ mario-rl-mvp/
    train_ppo.py
    eval.py
    record_video.py
+    plot_model_max_x_trend.py
    utils.py
  artifacts/
    models/
@@ -32,6 +33,44 @@ python -m pip install --upgrade pip setuptools wheel
 pip install -r requirements.txt
 ```

+## 2.1 环境准备（WSL / Ubuntu）
+
+如果系统 Python 缺少 `venv/pip`，推荐直接用 `uv` 创建环境并安装依赖：
+
+```bash
+cd /home/roog/super-mario/mario-rl-mvp
+uv venv .venv -p /usr/bin/python3.10
+uv pip install --python .venv/bin/python -r requirements.txt
+```
+
+如果你更倾向用系统 `venv`，先安装：
+
+```bash
+sudo apt-get update
+sudo apt-get install -y python3.10-venv python3-pip
+```
+
+### RTX 50 系列（如 RTX 5080）GPU 说明
+
+如果你看到类似：
+
+```text
+... CUDA capability sm_120 is not compatible with the current PyTorch installation ...
+```
+
+说明当前 torch wheel 不包含 `sm_120` 内核。可直接升级到 `cu128` nightly：
+
+```bash
+cd /home/roog/super-mario/mario-rl-mvp
+uv pip install --python .venv/bin/python --upgrade --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
+```
+
+验证 GPU：
+
+```bash
+.venv/bin/python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'); print(torch.cuda.get_device_capability(0) if torch.cuda.is_available() else 'N/A')"
+```
+
 可选系统依赖（用于 ffmpeg 转码与潜在 SDL 兼容）：

 ```bash
@@ -40,12 +79,19 @@ brew install ffmpeg sdl2

 ## 3. 一条命令开始训练

-默认 CPU 训练（如果检测到可用且稳定的 MPS，会自动尝试启用，否则自动回退 CPU）：
+默认 `--device auto` 训练（优先 CUDA，其次 MPS，最后 CPU）：

 ```bash
 python -m src.train_ppo
 ```

+显式指定 `--device cuda` 或 `--device mps` 时，如果该设备不可用，脚本会默认报错（避免静默回退到 CPU）。
+若你明确接受回退，可加：
+
+```bash
+python -m src.train_ppo --device cuda --allow-device-fallback
+```
+
 常用覆盖参数：

 ```bash
@@ -78,6 +124,30 @@ python -m src.train_ppo \
  --total-timesteps 300000
 ```

+我目前的参数
+
+```
+python -m src.train_ppo \
+  --init-model-path artifacts/models/latest_model.zip \
+  --n-envs 16 \
+  --allow-partial-init \
+  --reward-mode progress \
+  --movement simple \
+  --ent-coef 0.001 \
+  --learning-rate 2e-5 \
+  --n-steps 2048 \
+  --gamma 0.99 \
+  --death-penalty -50 \
+  --stall-penalty 0.05 \
+  --stall-steps 40 \
+  --backward-penalty-scale 0.01 \
+  --milestone-bonus 2.0 \
+  --no-progress-terminate-steps 300 \
+  --no-progress-terminate-penalty 10 \
+  --time-penalty -0.01 \
+  --total-timesteps 1200000
+```
+
 ### 3.1 从已有模型继续训练（`--init-model-path`）

 - 用途：加载已有 `.zip` 权重后继续训练，适合“不中断实验目标但调整探索参数”。
@@ -143,13 +213,35 @@ tensorboard --logdir artifacts/logs --port 6006
 加载最新模型，跑 N 个 episode，输出平均指标：

 ```bash
-python -m src.eval --episodes 5 --stochastic
+python -m src.eval \ 
+  --model-path artifacts/models/latest_model.zip \
+  --episodes 20 \
+  --movement simple \
+  --reward-mode progress \
+  --no-progress-terminate-steps 300 \
+  --no-progress-terminate-penalty 10 \
+  --death-penalty -50 \
+  --stall-penalty 0.05 \
+  --stall-steps 40 \
+  --time-penalty -0.01 \
+  --epsilon 0.08
 ```

 可指定模型：

 ```bash
-python -m src.eval --model-path artifacts/models/latest_model.zip --episodes 10 --stochastic
+python -m src.eval \
+  --model-path artifacts/models/latest_model.zip \
+  --episodes 20 \
+  --movement simple \
+  --reward-mode progress \
+  --no-progress-terminate-steps 300 \
+  --no-progress-terminate-penalty 10 \
+  --death-penalty -50 \
+  --stall-penalty 0.05 \
+  --stall-steps 40 \
+  --time-penalty -0.01 \
+  --stochastic
 ```

 注意：`eval.py` 默认 `--movement auto`，会按模型动作维度自动匹配 `right_only/simple`，避免动作空间不一致导致 `KeyError`。
@@ -174,13 +266,41 @@ _total_timesteps = 150000
 默认录制约 10 秒 mp4 到 `artifacts/videos/`：

 ```bash
-python -m src.record_video --duration-sec 10 --fps 30 --stochastic
+python -m src.record_video \
+  --model-path artifacts/models/latest_model.zip \
+  --movement simple \
+  --reward-mode progress \
+  --no-progress-terminate-steps 300 \
+  --no-progress-terminate-penalty 10 \
+  --death-penalty -50 \
+  --stall-penalty 0.05 \
+  --stall-steps 40 \
+  --time-penalty -0.01 \
+  --epsilon 0.08 \
+  --duration-sec 30
 ```

-可指定输出路径：
+或者稳定版本

 ```bash
-python -m src.record_video --output artifacts/videos/demo.mp4 --stochastic --duration-sec 10
+python -m src.record_video \
+  --model-path artifacts/models/latest_model.zip \
+  --movement simple \
+  --reward-mode progress \
+  --no-progress-terminate-steps 300 \
+  --no-progress-terminate-penalty 10 \
+  --death-penalty -50 \
+  --stall-penalty 0.05 \
+  --stall-steps 40 \
+  --time-penalty -0.01 \
+  --epsilon 0.08 \
+  --epsilon-random-mode uniform \
+  --max-steps 6000
+```
+可选:
+
+```bash
+--output artifacts/videos/mario_eps008.mp4
 ```

 注意：`record_video.py` 默认 `--movement auto`，会按模型自动匹配动作空间。
@@ -190,6 +310,47 @@ python -m src.record_video --output artifacts/videos/demo.mp4 --stochastic --dur
 - 默认通过 `imageio + ffmpeg` 输出 mp4
 - 若 mp4 写入失败，会自动降级保存帧序列（PNG），并打印 ffmpeg 转码命令

+## 5.1 模型趋势可视化（HTML / Markdown）
+
+用于可视化 `artifacts/models/` 里的模型在训练过程中的关键指标趋势，输出中文 HTML 或 Markdown 报告。
+
+默认命令：
+
+```bash
+python -m src.plot_model_max_x_trend
+```
+
+默认输出：
+
+- `artifacts/reports/model_max_x_trend.html`
+
+输出 Markdown 报告：
+
+```bash
+python -m src.plot_model_max_x_trend --format markdown
+```
+
+Markdown 默认输出：
+
+- `artifacts/reports/model_max_x_trend.md`
+
+可选参数（自定义目录/输出）：
+
+```bash
+uv run python -m src.plot_model_max_x_trend \
+  --models-dir artifacts/models \
+  --logs-dir artifacts/logs \
+  --format markdown \
+  --output artifacts/reports/model_max_x_trend.md
+```
+
+报告内容：
+
+- 主趋势：`max_x`（最大前进距离）  
+- 多维趋势：平均回报、平均回合步数、通关率、无进展终止率、死亡终止率、超时终止率、硬卡死终止率  
+- 模型明细表：每个 checkpoint/final 模型对应的指标值、匹配步数、来源 TensorBoard tag  
+- 术语解释：Run、Checkpoint、model_step、matched_step、TensorBoard Tag 等专有名词
+
 ## 6. 动作空间选择说明

 默认 `RIGHT_ONLY`，原因：
@@ -231,22 +392,23 @@ python -m src.train_ppo --reward-mode clip

 ```bash
 python -m src.train_ppo \
-  --init-model-path artifacts/models/latest_model.zip \
+  --init-model-path  /home/roog/super-mario/mario-rl-mvp/artifacts/models/ppo_SuperMarioBros-1-1-v0_20260212_205220/ppo_mario_ckpt_100000_steps.zip \
+  --n-envs 16 \
  --allow-partial-init \
  --reward-mode progress \
  --movement simple \
-  --ent-coef 0.04 \
+  --ent-coef 0.01 \
  --learning-rate 1e-4 \
-  --n-steps 512 \
-  --gamma 0.995 \
-  --death-penalty -120 \
-  --stall-penalty 0.2 \
-  --stall-steps 20 \
-  --backward-penalty-scale 0.03 \
+  --n-steps 1024 \
+  --gamma 0.99 \
+  --death-penalty -50 \
+  --stall-penalty 0.05 \
+  --stall-steps 40 \
+  --backward-penalty-scale 0.01 \
  --milestone-bonus 2.0 \
-  --no-progress-terminate-steps 80 \
-  --no-progress-terminate-penalty 30 \
-  --total-timesteps 150000
+  --no-progress-terminate-steps 300 \
+  --no-progress-terminate-penalty 10 \
+  --total-timesteps 300000
 ```

 ## 8. 常见问题排查
@@ -312,6 +474,30 @@ python -m src.train_ppo --init-model-path artifacts/models/latest_model.zip --mo

 3) 或者直接不加载旧模型，从头训练新动作空间。

+### 8.6 `cudaGetDeviceCount ... Error 304`（WSL 下 CUDA 初始化失败）
+
+如果训练启动时看到：
+
+```text
+[device] cpu | CUDA unavailable, using CPU.
+[device_diag] ... torch.cuda.is_available()=False ... Error 304 ...
+```
+
+说明不是 `device` 参数没传，而是 CUDA 运行时在当前环境初始化失败。
+
+先做两步确认：
+
+```bash
+nvidia-smi --query-gpu=name,driver_version,compute_cap --format=csv,noheader
+.venv/bin/python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available())"
+```
+
+常见原因是 WSL GPU 栈/驱动状态异常，而不是 PPO 代码本身。若你是临时跑通实验，可先显式 CPU：
+
+```bash
+python -m src.train_ppo --device cpu
+```
+
 ## 9. 最小 smoke test（按顺序执行）

 ```bash