Experiment: Multi-action Offset Training (A01_as20_long v3 series)

Experiment Overview

This experiment evaluates a new RL training mode where the agent learns with multi-action time offsets.

In multi-action mode (rl_action_offsets_ms has more than one value), the policy makes a single forward pass and predicts N actions for offsets 0, 10, 20, ... ms. The rollout then applies these actions on a 10 ms step period, and a replay transition corresponds to one decision block (N actions + aggregated reward over N steps). Exploration can be configured as either:

multi_action_exploration: per_action: epsilon is sampled independently per action inside the block.
multi_action_exploration: per_block: one epsilon draw applies to the whole block (either all greedy or all random).

Because a decision is made per block (N actions spanning multiple 10 ms steps), multi-action lookahead is applied at a lower decision frequency than single-action training, and in per_block mode the fully-random blocks become increasingly rare as epsilon decays.

Runs compared on map A01:

A01_as20_long_v3: multi-action enabled, multi_action_exploration default (per_action), global_schedule_speed = 1, no BC head pretrain.
A01_as20_long_v3.1: same multi-action setup, multi_action_exploration = per_block and faster schedules (global_schedule_speed = 4).
A01_as20_long_v3.1_pretrained_bc: same as v3.1 but initializes RL from BC heads with pretrain_bc_heads_path: output/ptretrain/bc/v5_multi_offset.

Notes on why global_schedule_speed = 4: this choice is based on the earlier ablation in docs/source/experiments/global_schedule_speed.rst (A01 long v2 series). The best saved A01 time is alltime_min_ms['A01'] = 24150 (i.e. ~``24.15s``) in save\\A01_as20_long_v2; in TensorBoard it shows up in the suffixed continuation run tensorboard\\A01_as20_long_v2_3.

For “longest run” comparison (almost 100M+ training steps): A01_as20_long (single-map A01, trained with tensorboard_suffix_schedule up to ~150M steps).

Results

Important: run lengths differ. Primary quantitative comparisons here use training steps (BY STEP tables from the analysis script). Any time-axis prose or regenerated tables must use cumulative training hours (--time-axis auto or cumul_training_hours), not raw TensorBoard wall minutes across merged logs — see Experiments.

Key findings

Multi-action schedule speedup (v3 → v3.1): A01_as20_long_v3.1 reaches strong alltime_min_ms_A01 much earlier in environment steps than A01_as20_long_v3 (see BY STEP output from the analysis command in Analysis Tools).
BC head pretraining (v3.1 → v3.1_pretrained_bc) improves peak time and finish rate at matched steps: - At 20M steps: eval best time 24.570s and finish rate 59% for v3.1_pretrained_bc vs 24.850s and 45% for v3.1. - At 80M steps: eval best time 24.260s and finish rate 73% for v3.1_pretrained_bc vs 24.410s and 67% for v3.1.
Comparison with the longest run A01_as20_long (step overlap is limited — common window only to ~19.2M steps): at that shared step, v3.1_pretrained_bc has better best time (24.570s vs 24.510s) but lower finish rate (59% vs 71%). The full v3.1_pretrained_bc run continues well beyond that overlap; the final logged best A01 24.26s comes from the longer run.
Direct check: A01_as20_long_v3.1 vs A01_as20_long_v2 (TB merged across suffix dirs). By steps (1M checkpoints), v2 stays ahead on A01 eval best time (e.g. 20M: 24.460s vs 24.850s; 40M: 24.300s vs 24.470s; 80M: 24.200s vs 24.410s). Final saved bests from save/<run>/accumulated_stats.joblib: v2 = 24.150s (24150 ms), v3.1 = 24.410s (24410 ms).
Main baseline vs multi-offset + BC heads: A01_as20_long_v2 (single-action) vs A01_as20_long_v3.1_pretrained_bc (multi-action + v5_multi_offset BC init); see Direct comparison: v2 vs v3.1_pretrained_bc for tables, plots, and save-state check. In short: v2 keeps better best and mean eval times and higher eval finish rate in the common window; saved alltime_min_ms['A01'] is 24150 ms vs 24260 ms.

Run Analysis

A01_as20_long_v3 / v3.1 / v3.1_pretrained_bc: merged TensorBoard dirs. Cumulative training for curves = cumul_training_hours (TB / save/.../accumulated_stats.joblib). Example: save/A01_as20_long_v3.1_pretrained_bc/ ≈ 24.5 h training, A01 best 24260 ms.
A01_as20_long (longest reference): same rules; see Time axis conventions (experiment write-ups) audit table.
A01_as20_long_v2: merged 3 TB dirs. Audited (local logs): ~17.7 h cumulative training vs ~2898 min TB wall span (ratio ~2.7×) — wall minutes must not be read as training duration. Pair with v3.1_pretrained_bc: ~24.5 h training vs ~3222 min wall (~2.2×).

Direct comparison: v2 vs v3.1_pretrained_bc

This section compares single-action A01_as20_long_v2 to multi-action + BC-pretrained A01_as20_long_v3.1_pretrained_bc. Merged suffix logs (3 vs 4 dirs). Cumulative training totals: ~17.74 h (v2) vs ~24.51 h (v3.1_bc); common BY TIME window ends at ~17.74 h (v2 stops earlier in training time). TB wall spans are ~2900 min vs ~3220 min — not the same as those training hours (see Time axis conventions (experiment write-ups)).

Save-state cross-check: save/.../accumulated_stats.joblib bests 24150 ms vs 24260 ms — matches 100M-step BY STEP eval-best rows and the last cumul_training_hours checkpoints below.

Recomputed BY TIME table (Race/eval_race_time_trained_A01, --time-axis cumul_training_hours, --interval-training-hours 1, merged TB, 2026-03):

2 h: best 24.55s vs 25.58s; mean 129.4s vs 184.7s; eval finish rate 62% vs 43%.
8 h: best 24.29s vs 24.40s; mean 91.0s vs 115.7s; rate 76% vs 68%.
~17.7 h (end of common window): best 24.15s vs 24.26s; mean 80.7s vs 98.6s; rate 80% vs 74%.

By-step comparisons use a common window up to 100M steps. Embedded figures use cumulative training hours on X (generate_experiment_plots.py default auto).

Detailed TensorBoard Metrics Analysis

Methodology — By time vs by steps: Regenerate the BY TIME grid with explicit training-hours axis:

python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc --time-axis cumul_training_hours --interval-training-hours 1 --step_interval 5000000

BY STEP uses the same command (step grid unchanged). Do not use wall-minute checkpoints for this pair: wall span >2× active training time (Time axis conventions (experiment write-ups)).

Figures: python scripts/generate_experiment_plots.py --experiments multi_offset_v2_vs_v31bc_pretrained

A01 eval — best time (`Race/eval_race_time_trained_A01`)

By cumulative training hours: see the 2 h / 8 h / ~17.7 h rows in the section above (recomputed from merged TensorBoard).

By steps: at 20M — 24.46s vs 24.57s; at 100M — 24.15s vs 24.26s.

A01 eval best race time by cumulative training hours (A01_as20_long_v2 vs A01_as20_long_v3.1_pretrained_bc)

A01 eval — mean time (all episodes; includes DNF / cutoff races)

Mean time is dominated by non-finished runs; use it as a stability / typical-episode signal alongside best time.

By cumulative training hours: mean-time checkpoints are in the same BY TIME table output as for best time (see command above).

By steps: at 100M — 80.69s vs 98.50s mean eval time; eval finish rate 80% vs 74%.

A01 eval mean race time by cumulative training hours (A01_as20_long_v2 vs A01_as20_long_v3.1_pretrained_bc)

Configuration differences (this pair)

``A01_as20_long_v2`` (save/A01_as20_long_v2/config_snapshot.yaml): single-action setup (no rl_action_offsets_ms list in snapshot; n_prev_actions_in_inputs: 5); training.global_schedule_speed: 4; training.pretrain_bc_heads_path: null.

``A01_as20_long_v3.1_pretrained_bc`` (save/A01_as20_long_v3.1_pretrained_bc/config_snapshot.yaml): environment.rl_action_offsets_ms: [0, 10, 20, 30, 40]; n_prev_actions_in_inputs: 25; exploration.multi_action_exploration: per_block; training.global_schedule_speed: 4; training.pretrain_bc_heads_path: output/ptretrain/bc/v5_multi_offset.

Configuration Changes

The runs share the same multi-action offsets:

environment.rl_action_offsets_ms = [0, 10, 20, 30, 40] (N=5 actions per decision block; applied on 10 ms rollout cadence).

Differences:

A01_as20_long_v3: - training.global_schedule_speed = 1 - exploration.multi_action_exploration uses default (per_action) - training.pretrain_bc_heads_path = null
A01_as20_long_v3.1: - training.global_schedule_speed = 4 - exploration.multi_action_exploration = per_block - training.pretrain_bc_heads_path = null
A01_as20_long_v3.1_pretrained_bc: - training.global_schedule_speed = 4 - exploration.multi_action_exploration = per_block - training.pretrain_bc_heads_path = output/ptretrain/bc/v5_multi_offset

Hardware

GPU: not extracted here (see individual run logs).
Parallel instances: gpu_collectors_count = 8 for the v3.1_pretrained_bc run.

Conclusions

Multi-action offset training works and is sensitive to schedule speed and exploration granularity: - Going from v3 to v3.1 (faster schedule + per_block exploration) improves early learning and reaches the ~24.8-24.5 s range quickly.
Pretraining the RL heads from the multi-offset BC run (v5_multi_offset) provides a durable benefit: - Higher peak time and higher finish rate at the same step levels (20M -> 80M). - Best time improvements continue through tens of millions of steps (see BY STEP rows above). - Likely reason: better temporal/action mapping between pretrain and RL. In BC pretrain, the model predicts 5 offset actions at 10 ms spacing; in multi-action RL, one decision is taken every ~50 ms and outputs a 5-action block. This alignment is much closer than the older single-action RL setup (one action every ~50 ms), where pretrain-to-RL mapping was weak.
Against the longest existing baseline (A01_as20_long trained up to ~150M steps), the offset+pretrained agent has a slower start but catches up later and slightly improves the final peak time in the shared window.
Against ``A01_as20_long_v2`` (single-action, same global_schedule_speed = 4), multi-offset + BC heads (``v3.1_pretrained_bc``) still trails on best eval time, mean eval time, and eval finish rate on the shared cumulative-training-hours window (figures / BY TIME table) and at 100M steps; saved bests 24.150s vs 24.260s (see Direct comparison: v2 vs v3.1_pretrained_bc).

Recommendations

If you adopt multi-action offsets, try global_schedule_speed = 4 and multi_action_exploration = per_block as the first tuning pair.
If you can afford it, initialize from a multi-offset BC run (here: v5_multi_offset) to improve long-run finish rate and to raise the achievable best time at large step counts.
For comparisons against other “longest run” baselines, expect limited step overlap if logging stops at different ranges; use training-hours (or wall, if you must) curves from the analysis script or embedded plots.

Analysis Tools

Compare v3 variants (default auto time axis + by steps; recommended: --step_interval 1000000):

python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v3 A01_as20_long_v3.1 A01_as20_long_v3.1_pretrained_bc --interval-training-hours 0.5 --step_interval 1000000
Compare against the longest baseline A01_as20_long:

python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v3.1_pretrained_bc A01_as20_long --interval-training-hours 0.5 --step_interval 1000000
Compare v2 vs v3.1_pretrained_bc (BY TIME + BY STEP): python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc --interval-training-hours 0.5 --step_interval 5000000
Plots for docs: python scripts/generate_experiment_plots.py --experiments multi_offset_v2_vs_v31bc_pretrained. One-off JPG: python scripts/analyze_experiment_by_relative_time.py A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc --time-axis cumul_training_hours --interval-training-hours 0.5 --step_interval 5000000 --plot --output-dir docs/source/_static --prefix exp_multi_offset_v2_vs_v31bc_pretrained
Audit wall vs training time: python scripts/audit_tensorboard_training_timeline.py --runs A01_as20_long_v2 A01_as20_long_v3.1_pretrained_bc