Epsilon-Greedy Exploration

This experiment tests the effect of epsilon_schedule (epsilon-greedy exploration) on convergence and policy quality. Both runs use temporal_mini_race_duration_ms = 7000 (7 s); the only change is exploration: uni_15 uses higher epsilon (0.5 at 300k steps, 0.1 at end) vs uni_12 baseline (default: 0.1 at 300k, 0.03 at end).

Runs: uni_12 (baseline exploration, 7 s segment), uni_15 (increased exploration, 7 s segment). Comparison is by relative time over the common window up to 55 min.

Experiment Overview

We compared epsilon schedules: uni_12 (default: epsilon 0.1 at 300k steps, 0.03 at 3M) vs uni_15 (0.5 at 300k steps, 0.1 at 3M). uni_12 ran ~55 min; uni_15 ~160 min; comparison is by relative time over the common window up to 55 min. The primary change is more exploration in uni_15 (slower decay, higher final epsilon).

Results

Important: Findings are by relative time (minutes from run start). Common window up to 55 min (uni_12 ended at 55 min; uni_15 ran ~160 min); metrics are compared at the same checkpoints (5, 10, …, 55 min).

Data source: Numbers from scripts/analyze_experiment_by_relative_time.py (per-race tables: Hock = long track ~55–70 s, A01 = short track ~24–25 s). Reproduce: python scripts/analyze_experiment_by_relative_time.py uni_12 uni_15 --interval 5 (--logdir "<path>" if needed).

Key findings (uni_12 vs uni_15):

  • uni_12 (default exploration) converges faster: A01 (eval) 24.85s by 20 min and holds; uni_15 reaches 33.98s by 10 min, 27.03s by 40 min, 26.73s at 55 min. At 55 min A01 uni_12 24.85s, uni_15 26.73s → uni_12 better.

  • Hock (explo): uni_12 69.61s by 20 min, 61.68s at 55 min; uni_15 first Hock finish only at ~44 min (85.89s), at 55 min 83.75s → uni_12 much better on Hock over the common window.

  • Training loss at 55 min: uni_12 102.84, uni_15 144.12 → uni_12 lower (better).

  • RL/avg_Q_trained_A01 at 55 min: uni_12 -0.71, uni_15 -1.27 → uni_12 better (less negative).

  • GPU utilization similar (~71–74% uni_12, ~68–70% uni_15).

Conclusion: Over the common 55 min window, increased exploration (uni_15) slows convergence: uni_12 reaches better A01 and Hock times earlier and has lower loss and better Q at 55 min. The higher epsilon schedule (0.5 at 300k, 0.1 at end) did not help within 55 min; default (0.1 at 300k, 0.03 at end) is better for this horizon.

Run Analysis

  • uni_12: epsilon_schedule default (0.1 at 300k, 0.03 at 3M), temporal_mini_race_duration_ms = 7000, ~55 min

  • uni_15: epsilon_schedule (0.5 at 300k, 0.1 at 3M), temporal_mini_race_duration_ms = 7000, ~160 min

TensorBoard logs: tensorboard\uni_12, tensorboard\uni_15. Reproduce: python scripts/analyze_experiment_by_relative_time.py uni_12 uni_15 --interval 5 (--logdir "<path>" if not from project root).

Detailed TensorBoard Metrics Analysis

Methodology — Relative time: Metrics at checkpoints 5, 10, 15, …, 55 min; common window up to 55 min. Race times from per-race tables; loss/Q/GPU% = last value at that moment. The figures below show one metric per graph (runs as lines, by relative time).

A01 (per-race eval_race_time_trained_A01)

  • uni_12: at 20 min 24.85s; at 55 min 24.85s.

  • uni_15: at 10 min 33.98s; at 40 min 27.03s; at 55 min 26.73s.

  • uni_12 reaches best A01 by 20 min and is better at 55 min (24.85s vs 26.73s).

A01 eval best time by relative time (uni_12 vs uni_15)

Hock (per-race explo_race_time_trained_hock)

  • uni_12: at 20 min 69.61s; at 55 min 61.68s.

  • uni_15: first finish at ~44 min (85.89s); at 55 min 83.75s.

  • uni_12 much better on Hock over the common window (61.68s vs 83.75s at 55 min).

Hock explo best time by relative time (uni_12 vs uni_15)

Training loss

  • uni_12: at 55 min 102.84.

  • uni_15: at 55 min 144.12; higher through most of the window.

  • uni_12 lower (better) at end of common window.

Training loss by relative time (uni_12 vs uni_15)

Average Q-values (RL/avg_Q_trained_A01)

  • uni_12: at 20 min -0.83; at 55 min -0.71.

  • uni_15: at 55 min -1.27; more negative over the run.

  • uni_12 better (less negative) at 55 min.

Avg Q by relative time (uni_12 vs uni_15)

GPU utilization (Performance/learner_percentage_training)

  • uni_12: ~71–74% over the window; at 55 min 71.9%.

  • uni_15: ~68–70% over the window; at 55 min 68.6%.

  • Similar; uni_12 slightly higher.

Configuration Changes

Exploration (exploration section in config YAML):

uni_12 (baseline — default):

epsilon_schedule = [
    (0, 1),
    (50_000, 1),
    (300_000, 0.1),   # default
    (3_000_000 * global_schedule_speed, 0.03),  # default
]

uni_15 (experimental — increased exploration):

epsilon_schedule = [
    (0, 1),
    (50_000, 1),
    (300_000, 0.5),   # 0.1 in default
    (3_000_000 * global_schedule_speed, 0.1),  # 0.03 in default
]

Environment: Both runs used temporal_mini_race_duration_ms = 7000 (same as baseline in temporal_mini_race_duration experiment).

Hardware

  • GPU: RTX 5090 (same as other experiments)

  • Parallel instances: 8 collectors

  • System: Same across runs

Conclusions

  1. Over the common 55 min window, uni_12 (default epsilon) outperforms uni_15 (increased exploration): faster convergence on A01 (24.85s by 20 min vs 26.73s at 55 min) and much better Hock (61.68s vs 83.75s at 55 min), lower loss (102.84 vs 144.12), and better Q (-0.71 vs -1.27).

  2. Higher epsilon schedule (0.5 at 300k, 0.1 at end) slowed convergence; default (0.1 at 300k, 0.03 at end) is preferable for the same wall-clock horizon.

  3. Recommendation: Keep default epsilon_schedule (0.1 at 300k, 0.03 at 3M) for fastest convergence. Use higher epsilon only if testing much longer runs or different exploration strategies.

Recommendations

  • Default: Prefer default epsilon_schedule (0.1 at 300k steps, 0.03 at 3M): faster convergence and better race times over 55 min vs increased exploration (uni_15).

  • When to try higher epsilon: Only for longer runs or dedicated exploration experiments; over 55 min, increased exploration did not improve results.

Analysis tools:

  • By relative time (2+ runs): python scripts/analyze_experiment_by_relative_time.py uni_12 uni_15 --interval 5 (--logdir "<path>" if not from project root).

  • Key metrics: Per-race Race/eval_race_time_*, Race/explo_race_time_*; scalars alltime_min_ms_hock, alltime_min_ms_A01, Training/loss, RL/avg_Q_trained_A01, Performance/learner_percentage_training (see tensorboard_metrics).