Experiment: Multi-action Offset Training (A01_as20_long v3 series)
===================================================================

Experiment Overview
-------------------

This experiment evaluates a new RL training mode where the agent learns with **multi-action time offsets**.

In multi-action mode (``rl_action_offsets_ms`` has more than one value), the policy makes a single forward pass and predicts ``N`` actions for offsets ``0, 10, 20, ...`` ms. The rollout then applies these actions on a 10 ms step period, and a replay transition corresponds to one **decision block** (N actions + aggregated reward over N steps). Exploration can be configured as either:

- ``multi_action_exploration: per_action``: epsilon is sampled independently per action inside the block.
- ``multi_action_exploration: per_block``: one epsilon draw applies to the whole block (either all greedy or all random).

Because a decision is made per block (N actions spanning multiple 10 ms steps), multi-action lookahead is applied at a lower decision frequency than single-action training, and in ``per_block`` mode the fully-random blocks become increasingly rare as epsilon decays.

Runs compared on map ``A01``:

- ``A01_as20_long_v3``: multi-action enabled, ``multi_action_exploration`` default (per_action), ``global_schedule_speed = 1``, no BC head pretrain.
- ``A01_as20_long_v3.1``: same multi-action setup, ``multi_action_exploration = per_block`` and faster schedules (``global_schedule_speed = 4``).
- ``A01_as20_long_v3.1_pretrained_bc``: same as v3.1 but initializes RL from BC heads with ``pretrain_bc_heads_path: output/ptretrain/bc/v5_multi_offset``.

Notes on why ``global_schedule_speed = 4``: this choice is based on the earlier ablation in ``docs/source/experiments/global_schedule_speed.rst`` (A01 long v2 series). The best saved A01 time is ``alltime_min_ms['A01'] = 24150`` (i.e. ~``24.15s``) in ``save\\A01_as20_long_v2``; in TensorBoard it shows up in the suffixed continuation run ``tensorboard\\A01_as20_long_v2_3``.

For “longest run” comparison (almost 100M+ training steps): ``A01_as20_long`` (single-map A01, trained with ``tensorboard_suffix_schedule`` up to ~150M steps).

Results
-------

Important: run lengths differ. **Primary quantitative comparisons here use training steps** (BY STEP tables from the analysis script). Any **time-axis** prose or regenerated tables must use **cumulative training hours** (``--time-axis auto`` or ``cumul_training_hours``), not raw TensorBoard wall minutes across merged logs — see :doc:`index`.

Key findings

- Multi-action schedule speedup (v3 → v3.1): ``A01_as20_long_v3.1`` reaches strong ``alltime_min_ms_A01`` much earlier **in environment steps** than ``A01_as20_long_v3`` (see BY STEP output from the analysis command in `Analysis Tools`_).
- BC head pretraining (v3.1 → v3.1_pretrained_bc) improves peak time and finish rate at matched steps:
  - At 20M steps: eval best time ``24.570s`` and finish rate ``59%`` for ``v3.1_pretrained_bc`` vs ``24.850s`` and ``45%`` for ``v3.1``.
  - At 80M steps: eval best time ``24.260s`` and finish rate ``73%`` for ``v3.1_pretrained_bc`` vs ``24.410s`` and ``67%`` for ``v3.1``.
- Comparison with the longest run ``A01_as20_long`` (step overlap is limited — common window only to ~19.2M steps): at that shared step, ``v3.1_pretrained_bc`` has better best time (``24.570s`` vs ``24.510s``) but lower finish rate (``59%`` vs ``71%``). The full ``v3.1_pretrained_bc`` run continues well beyond that overlap; the final logged best **A01 24.26s** comes from the longer run.
- Direct check: ``A01_as20_long_v3.1`` vs ``A01_as20_long_v2`` (TB merged across suffix dirs). **By steps** (1M checkpoints), ``v2`` stays ahead on A01 eval best time (e.g. 20M: ``24.460s`` vs ``24.850s``; 40M: ``24.300s`` vs ``24.470s``; 80M: ``24.200s`` vs ``24.410s``). Final saved bests from ``save/<run>/accumulated_stats.joblib``: ``v2 = 24.150s`` (``24150`` ms), ``v3.1 = 24.410s`` (``24410`` ms).
- **Main baseline vs multi-offset + BC heads:** ``A01_as20_long_v2`` (single-action) vs ``A01_as20_long_v3.1_pretrained_bc`` (multi-action + ``v5_multi_offset`` BC init); see `Direct comparison: v2 vs v3.1_pretrained_bc`_ for tables, plots, and save-state check. In short: ``v2`` keeps better **best** and **mean** eval times and higher eval finish rate in the common window; saved ``alltime_min_ms['A01']`` is ``24150`` ms vs ``24260`` ms.

Run Analysis
------------

- ``A01_as20_long_v3`` / ``v3.1`` / ``v3.1_pretrained_bc``: merged TensorBoard dirs. **Cumulative training** for curves = ``cumul_training_hours`` (TB / ``save/.../accumulated_stats.joblib``). Example: ``save/A01_as20_long_v3.1_pretrained_bc/`` ≈ **24.5 h** training, A01 best **24260 ms**.
- ``A01_as20_long`` (longest reference): same rules; see :doc:`time_axis_conventions` audit table.
- ``A01_as20_long_v2``: merged **3** TB dirs. **Audited (local logs):** **~17.7 h** cumulative training vs **~2898 min** TB wall span (ratio **~2.7×**) — wall minutes **must not** be read as training duration. Pair with ``v3.1_pretrained_bc``: **~24.5 h** training vs **~3222 min** wall (**~2.2×**).

Direct comparison: v2 vs v3.1_pretrained_bc
-------------------------------------------

This section compares **single-action** ``A01_as20_long_v2`` to **multi-action + BC-pretrained** ``A01_as20_long_v3.1_pretrained_bc``. Merged suffix logs (**3** vs **4** dirs). **Cumulative training** totals: **~17.74 h** (v2) vs **~24.51 h** (v3.1_bc); **common BY TIME window** ends at **~17.74 h** (v2 stops earlier in training time). TB **wall** spans are **~2900 min** vs **~3220 min** — **not** the same as those training hours (see :doc:`time_axis_conventions`).

**Save-state cross-check:** ``save/.../accumulated_stats.joblib`` bests **24150 ms** vs **24260 ms** — matches **100M-step** BY STEP eval-best rows and the last **cumul_training_hours** checkpoints below.

**Recomputed BY TIME table** (``Race/eval_race_time_trained_A01``, ``--time-axis cumul_training_hours``, ``--interval-training-hours 1``, merged TB, 2026-03):

- **2 h:** best **24.55s** vs **25.58s**; mean **129.4s** vs **184.7s**; eval finish rate **62%** vs **43%**.
- **8 h:** best **24.29s** vs **24.40s**; mean **91.0s** vs **115.7s**; rate **76%** vs **68%**.
- **~17.7 h** (end of common window): best **24.15s** vs **24.26s**; mean **80.7s** vs **98.6s**; rate **80%** vs **74%**.

**By-step** comparisons use a common window up to **100M** steps. Embedded figures use **cumulative training hours** on X (``generate_experiment_plots.py`` default **auto**).

Detailed TensorBoard Metrics Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Methodology — By time vs by steps:** Regenerate the BY TIME grid with explicit training-hours axis::

   python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc --time-axis cumul_training_hours --interval-training-hours 1 --step_interval 5000000

**BY STEP** uses the same command (step grid unchanged). Do **not** use wall-minute checkpoints for this pair: wall span **>2×** active training time (:doc:`time_axis_conventions`).

**Figures:** ``python scripts/generate_experiment_plots.py --experiments multi_offset_v2_vs_v31bc_pretrained``

A01 eval — best time (``Race/eval_race_time_trained_A01``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**By cumulative training hours:** see the **2 h / 8 h / ~17.7 h** rows in the section above (recomputed from merged TensorBoard).

**By steps:** at 20M — **24.46s** vs **24.57s**; at 100M — **24.15s** vs **24.26s**.

.. image:: ../_static/exp_multi_offset_v2_vs_v31bc_pretrained_A01_best.jpg
   :alt: A01 eval best race time by cumulative training hours (A01_as20_long_v2 vs A01_as20_long_v3.1_pretrained_bc)

A01 eval — mean time (all episodes; includes DNF / cutoff races)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Mean time is dominated by non-finished runs; use it as a **stability / typical-episode** signal alongside best time.

**By cumulative training hours:** mean-time checkpoints are in the same BY TIME table output as for best time (see command above).

**By steps:** at 100M — **80.69s** vs **98.50s** mean eval time; eval finish rate **80%** vs **74%**.

.. image:: ../_static/exp_multi_offset_v2_vs_v31bc_pretrained_A01_mean.jpg
   :alt: A01 eval mean race time by cumulative training hours (A01_as20_long_v2 vs A01_as20_long_v3.1_pretrained_bc)

Configuration differences (this pair)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**``A01_as20_long_v2``** (``save/A01_as20_long_v2/config_snapshot.yaml``): single-action setup (no ``rl_action_offsets_ms`` list in snapshot; ``n_prev_actions_in_inputs: 5``); ``training.global_schedule_speed: 4``; ``training.pretrain_bc_heads_path: null``.

**``A01_as20_long_v3.1_pretrained_bc``** (``save/A01_as20_long_v3.1_pretrained_bc/config_snapshot.yaml``): ``environment.rl_action_offsets_ms: [0, 10, 20, 30, 40]``; ``n_prev_actions_in_inputs: 25``; ``exploration.multi_action_exploration: per_block``; ``training.global_schedule_speed: 4``; ``training.pretrain_bc_heads_path: output/ptretrain/bc/v5_multi_offset``.

Configuration Changes
---------------------

The runs share the same multi-action offsets:

- ``environment.rl_action_offsets_ms = [0, 10, 20, 30, 40]`` (N=5 actions per decision block; applied on 10 ms rollout cadence).

Differences:

- ``A01_as20_long_v3``:
  - ``training.global_schedule_speed = 1``
  - ``exploration.multi_action_exploration`` uses default (per_action)
  - ``training.pretrain_bc_heads_path = null``
- ``A01_as20_long_v3.1``:
  - ``training.global_schedule_speed = 4``
  - ``exploration.multi_action_exploration = per_block``
  - ``training.pretrain_bc_heads_path = null``
- ``A01_as20_long_v3.1_pretrained_bc``:
  - ``training.global_schedule_speed = 4``
  - ``exploration.multi_action_exploration = per_block``
  - ``training.pretrain_bc_heads_path = output/ptretrain/bc/v5_multi_offset``

Hardware
--------

- GPU: not extracted here (see individual run logs).
- Parallel instances: ``gpu_collectors_count = 8`` for the v3.1_pretrained_bc run.

Conclusions
-----------

- Multi-action offset training works and is sensitive to schedule speed and exploration granularity:
  - Going from v3 to v3.1 (faster schedule + per_block exploration) improves early learning and reaches the ~24.8-24.5 s range quickly.
- Pretraining the RL heads from the multi-offset BC run (v5_multi_offset) provides a durable benefit:
  - Higher peak time and higher finish rate at the same step levels (20M -> 80M).
  - Best time improvements continue through **tens of millions of steps** (see BY STEP rows above).
  - Likely reason: better temporal/action mapping between pretrain and RL. In BC pretrain, the model predicts 5 offset actions at 10 ms spacing; in multi-action RL, one decision is taken every ~50 ms and outputs a 5-action block. This alignment is much closer than the older single-action RL setup (one action every ~50 ms), where pretrain-to-RL mapping was weak.
- Against the longest existing baseline (``A01_as20_long`` trained up to ~150M steps), the offset+pretrained agent has a slower start but catches up later and slightly improves the final peak time in the shared window.
- Against **``A01_as20_long_v2``** (single-action, same ``global_schedule_speed = 4``), multi-offset + BC heads (**``v3.1_pretrained_bc``**) still trails on **best** eval time, **mean** eval time, and eval **finish rate** on the shared **cumulative-training-hours** window (figures / BY TIME table) and at **100M** steps; saved bests **24.150s** vs **24.260s** (see `Direct comparison: v2 vs v3.1_pretrained_bc`_).

Recommendations
---------------

- If you adopt multi-action offsets, try ``global_schedule_speed = 4`` and ``multi_action_exploration = per_block`` as the first tuning pair.
- If you can afford it, initialize from a multi-offset BC run (here: ``v5_multi_offset``) to improve long-run finish rate and to raise the achievable best time at large step counts.
- For comparisons against other “longest run” baselines, expect limited **step** overlap if logging stops at different ranges; use **training-hours** (or wall, if you must) curves from the analysis script or embedded plots.

Analysis Tools
---------------

- Compare v3 variants (default **auto** time axis + by steps; recommended: ``--step_interval 1000000``):

  ``python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v3 A01_as20_long_v3.1 A01_as20_long_v3.1_pretrained_bc --interval-training-hours 0.5 --step_interval 1000000``

- Compare against the longest baseline ``A01_as20_long``:

  ``python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v3.1_pretrained_bc A01_as20_long --interval-training-hours 0.5 --step_interval 1000000``

- Compare **v2** vs **v3.1_pretrained_bc** (BY TIME + BY STEP): ``python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc --interval-training-hours 0.5 --step_interval 5000000``

- **Plots** for docs: ``python scripts/generate_experiment_plots.py --experiments multi_offset_v2_vs_v31bc_pretrained``. One-off JPG: ``python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc --time-axis cumul_training_hours --interval-training-hours 0.5 --step_interval 5000000 --plot --output-dir docs/source/_static --prefix exp_multi_offset_v2_vs_v31bc_pretrained``

- Audit wall vs training time: ``python scripts/audit_tensorboard_training_timeline.py --runs A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc``