Experiment: PPO Smoke Run vs IQN Baseline (A01_as20_long_v2) ============================================================= Experiment Overview ------------------- This note compares the **PPO** run **ppo_smoke_run** (on-policy actor-critic, ``save/ppo_smoke_run/config_snapshot.yaml``) to the **IQN** baseline **A01_as20_long_v2** (see :doc:`global_schedule_speed`). **TensorBoard layout:** use a **single** logical run name for the baseline: ``A01_as20_long_v2``. The analysis script merges suffix chunks automatically: - ``tensorboard/A01_as20_long_v2`` - ``tensorboard/A01_as20_long_v2_2`` - ``tensorboard/A01_as20_long_v2_3`` PPO logs live in ``tensorboard/ppo_smoke_run``. **What differs vs the IQN baseline:** algorithm (**ppo** vs IQN), on-policy rollouts vs replay, network stack, and **performance** (e.g. PPO ``running_speed: 32`` vs typical long IQN settings). This is **not** a controlled single-variable experiment. **Goal:** Compare learning curves **by cumulative training hours** and **by environment steps** over the **common window** where both runs overlap. Results ------- **Important:** Tables and plots use ``scripts/analyze_experiment_by_relative_time.py`` with ``--time-axis auto`` (**cumulative training hours**). **ppo_smoke_run** lasted **~6.61 h** of cumulative training; **A01_as20_long_v2** (merged TB) **~17.74 h**. All **by-time** comparisons below use checkpoints **0.5, 1.0, …, 6.5 h** (common window ends when the shorter run stops). **BY STEP** comparisons use checkpoints **1M … 21M** (common window = min of max steps across runs). **Key findings:** - **alltime_min_ms_A01** (scalar best-so-far): at **6.5 h** cumulative training, **ppo_smoke_run** **24.830 s** vs **A01_as20_long_v2** **24.300 s** (IQN ahead by **~0.53 s**). At **21M** steps: **24.830 s** vs **24.450 s** (IQN still ahead). - **Per-race eval** (``Race/eval_race_time_trained_A01``): at **6.5 h**, best **24.930 s** (PPO) vs **24.300 s** (IQN); eval **finish rate** (IQN only, from per-race finished events) **~75%** at 6.5 h vs **36%** at 0.5 h. PPO’s legacy log has **no** ``eval_race_finished_*`` events, so PPO rate columns stay empty. - **Per-race explo:** at **6.5 h**, best **24.850 s** (PPO) vs **24.410 s** (IQN). - **Robust eval** (``Race/eval_race_time_robust_trained_A01``): logged for **IQN only**; at **6.5 h** best **24.300 s** (mean **25.41 s** over finished robust races in that window). - **Training/loss** (IQN): **~337.8** at **6.5 h** (not comparable to PPO objective). **RL/avg_Q_trained_A01** at **6.5 h**: **~0.0017**. **Performance/learner_percentage_training** at **6.5 h**: **~64.8%**. - **Documented final** IQN best for this series remains **24.150 s** from saved state over the **full** long run (:doc:`global_schedule_speed`). **ppo_smoke_run** ``accumulated_stats.joblib`` best **24.830 s** (**21,627,235** frames). - **Legacy PPO loss logging:** ``Training/ppo_loss`` in the smoke TensorBoard used **update index** as step, so **BY STEP** loss and loss-vs-hours alignment are **not** trustworthy until a re-run with the fixed learner (frame-aligned ``global_step``). A misleading dual-run **ppo_loss** JPG is **not** committed. Run Analysis ------------ - **ppo_smoke_run:** PPO; ``rollout_steps_per_update: 2048``; ``running_speed: 32``; ``gpu_collectors_count: 8``. **~6.61 h** cumulative training (script). TensorBoard: ``tensorboard/ppo_smoke_run``. - **A01_as20_long_v2:** IQN baseline; TensorBoard merged from **three** folders (see Overview). **~17.74 h** cumulative training (script). Save / final best: :doc:`global_schedule_speed`. **Reproduce (tables + optional plots):** :: python scripts/analyze_experiment_by_relative_time.py --logdir tensorboard ppo_smoke_run A01_as20_long_v2 --interval-training-hours 0.5 --step_interval 1000000 :: python scripts/analyze_experiment_by_relative_time.py --logdir tensorboard ppo_smoke_run A01_as20_long_v2 --interval-training-hours 0.5 --step_interval 1000000 --plot --output-dir docs/source/_static --prefix exp_ppo_smoke_vs_A01_v2 Detailed TensorBoard Metrics Analysis ------------------------------------- **Methodology:** Same as ``analyze_experiment_by_relative_time.py`` output: per-race stats and scalars at shared cumulative-training-hour and step checkpoints. The figures below show **one metric per graph** (both runs as lines, **cumulative training hours** on the X axis). A01 eval — best time (``Race/eval_race_time_trained_A01``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** PPO best **24.930 s**; IQN best **24.300 s**. - **21M steps:** PPO best **24.930 s**; IQN best **24.450 s**. .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_A01_best.jpg :alt: A01 eval best race time by cumulative training hours (ppo_smoke_run vs A01_as20_long_v2) A01 eval — mean time (all episodes) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** PPO mean **296.67 s**; IQN mean **94.16 s** (IQN finishes a larger fraction of eval races). .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_A01_mean.jpg :alt: A01 eval mean race time by cumulative training hours (ppo_smoke_run vs A01_as20_long_v2) A01 eval — finish rate (IQN only in this comparison) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** IQN **75%** finish rate (PPO column empty in script: no legacy ``eval_race_finished_*`` in TB). .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_A01_rate.jpg :alt: A01 eval finish rate by cumulative training hours (A01_as20_long_v2; PPO N/A) Robust eval (IQN only): ``Race/eval_race_time_robust_trained_A01`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - PPO does not log this tag; the figure shows the **IQN** series only. .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_Race_eval_race_time_robust_trained_A01_best.jpg :alt: A01 robust eval best time by cumulative training hours (A01_as20_long_v2) .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_Race_eval_race_time_robust_trained_A01_mean.jpg :alt: A01 robust eval mean time by cumulative training hours (A01_as20_long_v2) Best-so-far scalar (``alltime_min_ms_A01``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** **24.830 s** (PPO) vs **24.300 s** (IQN). .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_a01_best_time_ms.jpg :alt: alltime_min_ms A01 by cumulative training hours (ppo_smoke_run vs A01_as20_long_v2) Training loss (IQN: ``Training/loss``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** IQN **~337.82** (last value at checkpoint). PPO has no IQN loss. .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_loss.jpg :alt: Training loss by cumulative training hours (A01_as20_long_v2; IQN TD loss) Average Q (IQN: ``RL/avg_Q_trained_A01``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** **~0.0017**. .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_avg_q.jpg :alt: Avg Q trained A01 by cumulative training hours (A01_as20_long_v2) Learner training percentage (IQN) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **6.5 h:** **~64.8%**. .. image:: ../_static/exp_ppo_smoke_vs_A01_v2_training_pct.jpg :alt: Learner percentage training by cumulative training hours (A01_as20_long_v2) Configuration Changes --------------------- **ppo_smoke_run** (``save/ppo_smoke_run/config_snapshot.yaml``): ``training.algorithm: ppo``, ``run_name: ppo_smoke_run``, ``global_schedule_speed: 1``, PPO block as in snapshot; ``performance.running_speed: 32``, ``gpu_collectors_count: 8``. **A01_as20_long_v2** (IQN): :doc:`global_schedule_speed` (e.g. ``global_schedule_speed: 4``, ``batch_size: 4096``). Hardware -------- - **PPO run:** 8 GPU collectors per config; GPU model and host are local to the machine that produced the logs. Conclusions ----------- - With **merged** TensorBoard (**``A01_as20_long_v2`` + ``_2`` + ``_3``**), IQN is **ahead** of this PPO smoke run on **alltime_min**, **eval best**, and **explo best** within the **shared 6.5 h** and **21M-step** windows, and shows **high eval finish rate** while PPO’s eval mean time remains dominated by non-finishes. - **Robust eval** and **Q / IQN loss** metrics are **IQN-specific**; PPO comparison should rely on race times and ``alltime_min_ms_*`` (and frame-aligned PPO scalars after the learner fix). Recommendations --------------- - Always pass the **base** run name ``A01_as20_long_v2`` to the script so **``_2`` / ``_3``** chunks merge; do not analyze only the first folder. - For PPO vs IQN at **equal cumulative training**, use ``--interval-training-hours`` and the printed hour checkpoints; for **equal env steps**, use the **BY STEP** section. - Re-run PPO with frame-aligned ``Training/ppo_loss`` logging before comparing policy-gradient loss to IQN loss. **Analysis tools** - ``python scripts/analyze_experiment_by_relative_time.py ppo_smoke_run A01_as20_long_v2 --logdir tensorboard --interval-training-hours 0.5 --step_interval 1000000`` - Plots: ``--plot --output-dir docs/source/_static --prefix exp_ppo_smoke_vs_A01_v2`` (omit regenerating ``*_ppo_loss.jpg`` until PPO TB steps are fixed).