- Implemented policy_utils.py with helper functions for action selection, including epsilon-greedy support.
- Updated `requirements.txt` to relax PyTorch version constraint for better GPU compatibility. - Added detailed GPU setup instructions, new device fallback options, and command examples to `README.md`. - Developed a new script `plot_model_max_x_trend.py` for visualizing training trends, generating HTML/Markdown reports.
This commit is contained in:
@@ -11,6 +11,7 @@ mario-rl-mvp/
|
||||
train_ppo.py
|
||||
eval.py
|
||||
record_video.py
|
||||
plot_model_max_x_trend.py
|
||||
utils.py
|
||||
artifacts/
|
||||
models/
|
||||
@@ -32,6 +33,44 @@ python -m pip install --upgrade pip setuptools wheel
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## 2.1 环境准备(WSL / Ubuntu)
|
||||
|
||||
如果系统 Python 缺少 `venv/pip`,推荐直接用 `uv` 创建环境并安装依赖:
|
||||
|
||||
```bash
|
||||
cd /home/roog/super-mario/mario-rl-mvp
|
||||
uv venv .venv -p /usr/bin/python3.10
|
||||
uv pip install --python .venv/bin/python -r requirements.txt
|
||||
```
|
||||
|
||||
如果你更倾向用系统 `venv`,先安装:
|
||||
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y python3.10-venv python3-pip
|
||||
```
|
||||
|
||||
### RTX 50 系列(如 RTX 5080)GPU 说明
|
||||
|
||||
如果你看到类似:
|
||||
|
||||
```text
|
||||
... CUDA capability sm_120 is not compatible with the current PyTorch installation ...
|
||||
```
|
||||
|
||||
说明当前 torch wheel 不包含 `sm_120` 内核。可直接升级到 `cu128` nightly:
|
||||
|
||||
```bash
|
||||
cd /home/roog/super-mario/mario-rl-mvp
|
||||
uv pip install --python .venv/bin/python --upgrade --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
|
||||
```
|
||||
|
||||
验证 GPU:
|
||||
|
||||
```bash
|
||||
.venv/bin/python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'); print(torch.cuda.get_device_capability(0) if torch.cuda.is_available() else 'N/A')"
|
||||
```
|
||||
|
||||
可选系统依赖(用于 ffmpeg 转码与潜在 SDL 兼容):
|
||||
|
||||
```bash
|
||||
@@ -40,12 +79,19 @@ brew install ffmpeg sdl2
|
||||
|
||||
## 3. 一条命令开始训练
|
||||
|
||||
默认 CPU 训练(如果检测到可用且稳定的 MPS,会自动尝试启用,否则自动回退 CPU):
|
||||
默认 `--device auto` 训练(优先 CUDA,其次 MPS,最后 CPU):
|
||||
|
||||
```bash
|
||||
python -m src.train_ppo
|
||||
```
|
||||
|
||||
显式指定 `--device cuda` 或 `--device mps` 时,如果该设备不可用,脚本会默认报错(避免静默回退到 CPU)。
|
||||
若你明确接受回退,可加:
|
||||
|
||||
```bash
|
||||
python -m src.train_ppo --device cuda --allow-device-fallback
|
||||
```
|
||||
|
||||
常用覆盖参数:
|
||||
|
||||
```bash
|
||||
@@ -78,6 +124,30 @@ python -m src.train_ppo \
|
||||
--total-timesteps 300000
|
||||
```
|
||||
|
||||
我目前的参数
|
||||
|
||||
```
|
||||
python -m src.train_ppo \
|
||||
--init-model-path artifacts/models/latest_model.zip \
|
||||
--n-envs 16 \
|
||||
--allow-partial-init \
|
||||
--reward-mode progress \
|
||||
--movement simple \
|
||||
--ent-coef 0.001 \
|
||||
--learning-rate 2e-5 \
|
||||
--n-steps 2048 \
|
||||
--gamma 0.99 \
|
||||
--death-penalty -50 \
|
||||
--stall-penalty 0.05 \
|
||||
--stall-steps 40 \
|
||||
--backward-penalty-scale 0.01 \
|
||||
--milestone-bonus 2.0 \
|
||||
--no-progress-terminate-steps 300 \
|
||||
--no-progress-terminate-penalty 10 \
|
||||
--time-penalty -0.01 \
|
||||
--total-timesteps 1200000
|
||||
```
|
||||
|
||||
### 3.1 从已有模型继续训练(`--init-model-path`)
|
||||
|
||||
- 用途:加载已有 `.zip` 权重后继续训练,适合“不中断实验目标但调整探索参数”。
|
||||
@@ -143,13 +213,35 @@ tensorboard --logdir artifacts/logs --port 6006
|
||||
加载最新模型,跑 N 个 episode,输出平均指标:
|
||||
|
||||
```bash
|
||||
python -m src.eval --episodes 5 --stochastic
|
||||
python -m src.eval \
|
||||
--model-path artifacts/models/latest_model.zip \
|
||||
--episodes 20 \
|
||||
--movement simple \
|
||||
--reward-mode progress \
|
||||
--no-progress-terminate-steps 300 \
|
||||
--no-progress-terminate-penalty 10 \
|
||||
--death-penalty -50 \
|
||||
--stall-penalty 0.05 \
|
||||
--stall-steps 40 \
|
||||
--time-penalty -0.01 \
|
||||
--epsilon 0.08
|
||||
```
|
||||
|
||||
可指定模型:
|
||||
|
||||
```bash
|
||||
python -m src.eval --model-path artifacts/models/latest_model.zip --episodes 10 --stochastic
|
||||
python -m src.eval \
|
||||
--model-path artifacts/models/latest_model.zip \
|
||||
--episodes 20 \
|
||||
--movement simple \
|
||||
--reward-mode progress \
|
||||
--no-progress-terminate-steps 300 \
|
||||
--no-progress-terminate-penalty 10 \
|
||||
--death-penalty -50 \
|
||||
--stall-penalty 0.05 \
|
||||
--stall-steps 40 \
|
||||
--time-penalty -0.01 \
|
||||
--stochastic
|
||||
```
|
||||
|
||||
注意:`eval.py` 默认 `--movement auto`,会按模型动作维度自动匹配 `right_only/simple`,避免动作空间不一致导致 `KeyError`。
|
||||
@@ -174,13 +266,41 @@ _total_timesteps = 150000
|
||||
默认录制约 10 秒 mp4 到 `artifacts/videos/`:
|
||||
|
||||
```bash
|
||||
python -m src.record_video --duration-sec 10 --fps 30 --stochastic
|
||||
python -m src.record_video \
|
||||
--model-path artifacts/models/latest_model.zip \
|
||||
--movement simple \
|
||||
--reward-mode progress \
|
||||
--no-progress-terminate-steps 300 \
|
||||
--no-progress-terminate-penalty 10 \
|
||||
--death-penalty -50 \
|
||||
--stall-penalty 0.05 \
|
||||
--stall-steps 40 \
|
||||
--time-penalty -0.01 \
|
||||
--epsilon 0.08 \
|
||||
--duration-sec 30
|
||||
```
|
||||
|
||||
可指定输出路径:
|
||||
或者稳定版本
|
||||
|
||||
```bash
|
||||
python -m src.record_video --output artifacts/videos/demo.mp4 --stochastic --duration-sec 10
|
||||
python -m src.record_video \
|
||||
--model-path artifacts/models/latest_model.zip \
|
||||
--movement simple \
|
||||
--reward-mode progress \
|
||||
--no-progress-terminate-steps 300 \
|
||||
--no-progress-terminate-penalty 10 \
|
||||
--death-penalty -50 \
|
||||
--stall-penalty 0.05 \
|
||||
--stall-steps 40 \
|
||||
--time-penalty -0.01 \
|
||||
--epsilon 0.08 \
|
||||
--epsilon-random-mode uniform \
|
||||
--max-steps 6000
|
||||
```
|
||||
可选:
|
||||
|
||||
```bash
|
||||
--output artifacts/videos/mario_eps008.mp4
|
||||
```
|
||||
|
||||
注意:`record_video.py` 默认 `--movement auto`,会按模型自动匹配动作空间。
|
||||
@@ -190,6 +310,47 @@ python -m src.record_video --output artifacts/videos/demo.mp4 --stochastic --dur
|
||||
- 默认通过 `imageio + ffmpeg` 输出 mp4
|
||||
- 若 mp4 写入失败,会自动降级保存帧序列(PNG),并打印 ffmpeg 转码命令
|
||||
|
||||
## 5.1 模型趋势可视化(HTML / Markdown)
|
||||
|
||||
用于可视化 `artifacts/models/` 里的模型在训练过程中的关键指标趋势,输出中文 HTML 或 Markdown 报告。
|
||||
|
||||
默认命令:
|
||||
|
||||
```bash
|
||||
python -m src.plot_model_max_x_trend
|
||||
```
|
||||
|
||||
默认输出:
|
||||
|
||||
- `artifacts/reports/model_max_x_trend.html`
|
||||
|
||||
输出 Markdown 报告:
|
||||
|
||||
```bash
|
||||
python -m src.plot_model_max_x_trend --format markdown
|
||||
```
|
||||
|
||||
Markdown 默认输出:
|
||||
|
||||
- `artifacts/reports/model_max_x_trend.md`
|
||||
|
||||
可选参数(自定义目录/输出):
|
||||
|
||||
```bash
|
||||
uv run python -m src.plot_model_max_x_trend \
|
||||
--models-dir artifacts/models \
|
||||
--logs-dir artifacts/logs \
|
||||
--format markdown \
|
||||
--output artifacts/reports/model_max_x_trend.md
|
||||
```
|
||||
|
||||
报告内容:
|
||||
|
||||
- 主趋势:`max_x`(最大前进距离)
|
||||
- 多维趋势:平均回报、平均回合步数、通关率、无进展终止率、死亡终止率、超时终止率、硬卡死终止率
|
||||
- 模型明细表:每个 checkpoint/final 模型对应的指标值、匹配步数、来源 TensorBoard tag
|
||||
- 术语解释:Run、Checkpoint、model_step、matched_step、TensorBoard Tag 等专有名词
|
||||
|
||||
## 6. 动作空间选择说明
|
||||
|
||||
默认 `RIGHT_ONLY`,原因:
|
||||
@@ -231,22 +392,23 @@ python -m src.train_ppo --reward-mode clip
|
||||
|
||||
```bash
|
||||
python -m src.train_ppo \
|
||||
--init-model-path artifacts/models/latest_model.zip \
|
||||
--init-model-path /home/roog/super-mario/mario-rl-mvp/artifacts/models/ppo_SuperMarioBros-1-1-v0_20260212_205220/ppo_mario_ckpt_100000_steps.zip \
|
||||
--n-envs 16 \
|
||||
--allow-partial-init \
|
||||
--reward-mode progress \
|
||||
--movement simple \
|
||||
--ent-coef 0.04 \
|
||||
--ent-coef 0.01 \
|
||||
--learning-rate 1e-4 \
|
||||
--n-steps 512 \
|
||||
--gamma 0.995 \
|
||||
--death-penalty -120 \
|
||||
--stall-penalty 0.2 \
|
||||
--stall-steps 20 \
|
||||
--backward-penalty-scale 0.03 \
|
||||
--n-steps 1024 \
|
||||
--gamma 0.99 \
|
||||
--death-penalty -50 \
|
||||
--stall-penalty 0.05 \
|
||||
--stall-steps 40 \
|
||||
--backward-penalty-scale 0.01 \
|
||||
--milestone-bonus 2.0 \
|
||||
--no-progress-terminate-steps 80 \
|
||||
--no-progress-terminate-penalty 30 \
|
||||
--total-timesteps 150000
|
||||
--no-progress-terminate-steps 300 \
|
||||
--no-progress-terminate-penalty 10 \
|
||||
--total-timesteps 300000
|
||||
```
|
||||
|
||||
## 8. 常见问题排查
|
||||
@@ -312,6 +474,30 @@ python -m src.train_ppo --init-model-path artifacts/models/latest_model.zip --mo
|
||||
|
||||
3) 或者直接不加载旧模型,从头训练新动作空间。
|
||||
|
||||
### 8.6 `cudaGetDeviceCount ... Error 304`(WSL 下 CUDA 初始化失败)
|
||||
|
||||
如果训练启动时看到:
|
||||
|
||||
```text
|
||||
[device] cpu | CUDA unavailable, using CPU.
|
||||
[device_diag] ... torch.cuda.is_available()=False ... Error 304 ...
|
||||
```
|
||||
|
||||
说明不是 `device` 参数没传,而是 CUDA 运行时在当前环境初始化失败。
|
||||
|
||||
先做两步确认:
|
||||
|
||||
```bash
|
||||
nvidia-smi --query-gpu=name,driver_version,compute_cap --format=csv,noheader
|
||||
.venv/bin/python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available())"
|
||||
```
|
||||
|
||||
常见原因是 WSL GPU 栈/驱动状态异常,而不是 PPO 代码本身。若你是临时跑通实验,可先显式 CPU:
|
||||
|
||||
```bash
|
||||
python -m src.train_ppo --device cpu
|
||||
```
|
||||
|
||||
## 9. 最小 smoke test(按顺序执行)
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user