Batch Size and Running Speed ============================ Experiment Overview ------------------- This document covers experiments on **batch_size** and **running_speed**. Baseline: **uni_5** (batch 2048, speed 160×). **Direction 1 — Larger batch:** Increasing batch to 8192 (uni_6) **worsened** convergence and loss; RTX 5090 hit a performance ceiling. **Direction 2 — Smaller batch + faster speed:** Reducing batch to 512 and speed to 512× (uni_7) **improved** convergence, loss, and GPU utilization. Follow-up runs uni_8 and uni_9 (even smaller batch, higher speed) **reversed** those gains: over-tuning hurt. Conclusion: a **golden middle** between batch_size and running_speed is needed; uni_7 (512/512) is in that range. **Direction 3 — Map cycle length:** uni_11 and uni_12 kept the same batch/speed as uni_7 (512/512) but changed the map cycle. **uni_11**: 256 hock – 256 A01 — at 55 min uni_11 better Hock (60.69s vs 61.94s), uni_7 better A01; ambiguous. **uni_12**: 64 hock – 64 A01 — **uni_12 converges faster on A01** (24.85s by 20 min vs uni_7 by 40 min), lower loss and better Q over 55 min; at 55 min uni_7 slightly better A01, Hock close. uni_12 **clearly better** than uni_11 on A01. **Conclusion:** 64–64 is recommended; prefer it over 4–4 and over 256–256. Results ------- **Important:** Experiments had different durations (uni_5 ~160 min, uni_7 ~86 min, etc.), so comparing by “last value” is meaningless. All findings below are based on **relative time** — minutes from run start; metrics are compared at the same moments (5, 10, 20, … min), up to when the shortest compared run ends. **Data source:** All numbers below are from ``scripts/analyze_experiment_by_relative_time.py`` (per-race tables for race times: **Hock** = long track ~55–70 s, **A01** = short track ~24–25 s). Reproduce: ``python scripts/analyze_experiment_by_relative_time.py uni_5 uni_7 ... --interval 5`` (``--logdir `` if needed). **Key Findings (by relative time):** - **Larger batch (uni_6)**: At 160 min — worse best times (Hock 66.94s vs 58.70s for uni_5; A01 24.77s vs 24.56s), loss ~10× higher (2979 vs 301), GPU % no gain (~52% vs ~54%). Do not go above 2048. - **Smaller batch + faster speed (uni_7)**: uni_7 reaches good times faster (at 60 min Hock 60.58s vs uni_5 69.27s; at 85 min Hock 58.76s vs 65.04s; A01 tie 24.58s). Loss much lower, Q better, GPU ~78% vs ~53%. Best trade-off. - **Over-tuning (uni_8, uni_9)**: At 70 min — uni_7 best: Hock 60.58s vs 74.24s / 80.23s; A01 24.63s vs 25.18s / 25.25s. uni_8/9 have higher GPU % but worse policies. - **Fewer collectors (uni_10, gpu_collectors_count=4)**: At 85 min uni_7 ahead: Hock 58.76s vs 62.11s, A01 24.58s vs 24.72s. **Main point** — throughput and GPU %: ~59% vs ~79%. For maximum training speed, 8 collectors are better. - **Map cycle 256 hock – 256 A01 (uni_11 vs uni_7)**: At 55 min — uni_11 better Hock (60.69s vs 61.94s), uni_7 better A01 (24.72s vs 25.08s). Loss and Q close (uni_7 113, uni_11 106; Q -1.24 vs -1.10). GPU % uni_7 ~79% vs uni_11 ~70%. Interpretation ambiguous. - **Map cycle 64 hock – 64 A01 (uni_12 vs uni_7, uni_11)**: **uni_12 converges faster on A01** — 24.85s by 20 min (uni_7 24.95s only by 40 min). At 55 min: Hock uni_7 61.94s, uni_12 61.68s (close); A01 uni_7 24.72s, uni_12 24.85s (uni_7 slightly better). Loss and Q better for uni_12 (102.8 vs 113, -0.71 vs -1.24). Versus uni_11: **uni_12 better** on A01 and convergence. **Conclusion:** 64–64 is **recommended**: faster A01 convergence and better loss/Q than uni_7 over 55 min; clearly better than 256–256 on A01. Run Analysis ------------ All runs use the same hardware (RTX 5090). uni_5–uni_9: 8 collectors; uni_10, uni_11, uni_12: 8 collectors. TensorBoard logs: ``tensorboard\uni_``. **Durations (relative time):** uni_5 ~160 min, uni_6 ~165 min, uni_7 ~86 min, uni_8 ~76 min, uni_9 ~194 min, uni_10 ~71 min, uni_11 ~55 min, uni_12 ~55 min. - **uni_5**: Baseline — batch_size = 2048, running_speed = 160, ~160 min - **uni_6**: Larger batch 8192, speed 160, ~165 min — by relative time worse Hock and loss - **uni_6_2**: Continuation of uni_6 (same config) - **uni_7**: Smaller batch 512, speed 512, 8 collectors, map cycle **4 hock – 4 A01** (+ 1 eval A01 at end), ~86 min — **best trade-off** by relative time - **uni_8**: Further smaller batch + higher speed, ~76 min — worse than uni_7 on times and Q on common window - **uni_9**: Extreme (e.g. batch 64, speed 1024), ~194 min — worse than uni_7 on common window up to 70 min - **uni_10**: Same as uni_7 (batch 512, speed 512) but **gpu_collectors_count = 4**, ~71 min — sometimes slightly faster on time early on, by 70 min slightly worse; **throughput and GPU % lower** (~59% vs ~79%). - **uni_11**: Same as uni_7 (batch 512, speed 512, 8 collectors) but map cycle **256 hock – 256 A01** (+ 1 eval A01 at end), ~55 min — by relative time slightly more events than uni_7; metrics close, uni_7 slightly better on times and GPU % over common window up to 55 min. - **uni_12**: Same as uni_7 (batch 512, speed 512, 8 collectors) but map cycle **64 hock – 64 A01** (+ 1 eval A01 at end), ~55 min — by relative time **faster convergence**, equal or better A01 and loss/Q than uni_7 over 55 min; **clearly better** than uni_11 (256–256). **Recommended** map cycle length for this setup. To reproduce metrics by **relative time**: ``python scripts/analyze_experiment_by_relative_time.py uni_5 uni_7 --interval 5`` (or ``uni_5 uni_6 uni_7 uni_10 uni_11 uni_12`` for all runs; ``--logdir ""`` if not from project root). For last-value comparison: ``python scripts/analyze_experiment.py uni_5 uni_6 uni_7 ...``. Analysis methodology ~~~~~~~~~~~~~~~~~~~~ The script ``analyze_experiment_by_relative_time.py`` supports **two or more runs** (e.g. ``uni_5 uni_6 uni_7 uni_10 uni_11 uni_12``). It compares by **relative time** (minutes from run start); checkpoints are 5, 10, 15, … min up to the shortest run. - **Race times (preferred):** Uses per-race events ``Race/eval_race_time_*`` and ``Race/explo_race_time_*``. For each run, one **run-wide t0** (min wall_time across race tags) is used so "5 min" is the same moment for all tags. At each checkpoint T min: **best** = min of race times with rel_min ≤ T, **mean**, **std** (stability), **best_fin** (best among finished only), **finish rate**, **first finish** (minute). This gives more dynamics and stability info than scalar metrics alone. - **Scalar metrics:** ``alltime_min_ms_{map}``, ``Training/loss``, ``RL/avg_Q_*``, ``Performance/learner_percentage_training`` at each checkpoint (best so far for race times, last value for loss/Q/GPU). - Race times in the script output use **per-race tables** (``Race/eval_race_time_*``, ``Race/explo_race_time_*``); Hock is the long track (~55–70 s), A01 the short track (~24–25 s). Detailed TensorBoard Metrics Analysis ------------------------------------- Metrics below are from TensorBoard logs (``tensorboard\uni_``). Baseline is uni_5 (2048 batch, 160 speed). The figures below show comparison plots (one metric per graph, runs as lines, by relative time) for the main comparisons. **Methodology — Relative time:** Experiments had different durations (see Run Analysis), so comparing by “last value” is invalid. Metrics are aligned by **relative time** — minutes from run start. Values are taken at checkpoints 5, 10, 15, 20, … min; comparison runs only until the shortest compared run is still going. Race times: use **per-race events** (script prints best/mean/std/finish rate at each checkpoint) or scalar ``alltime_min_ms_*`` (best so far at that moment). Loss / Q / GPU %: **last value at that moment**. Tables: ``python scripts/analyze_experiment_by_relative_time.py [ ...] [--interval 5]``. **Key metrics** (aligned with :doc:`tensorboard_metrics`): Per-race: ``Race/eval_race_time_*``, ``Race/explo_race_time_*`` (best/mean/std, finish rate). Scalars: ``alltime_min_ms_{map}``, ``Training/loss``, ``RL/avg_Q_*``, ``Performance/learner_percentage_training``. Also: ``Performance/transitions_learned_per_second``, ``Gradients/norm_before_clip_max``. For interpretation see that file. Larger batch: uni_5 vs uni_6 (relative time, common window up to 160 min) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Hock** (per-race explo): at 160 min — uni_5 58.70s, uni_6 66.94s → **uni_6 worse**. - **A01** (per-race eval): at 160 min — uni_5 24.56s, uni_6 24.77s → **uni_6 worse**. - **Training/loss**: at 160 min — uni_5 301, uni_6 2979 → **~10× higher** in uni_6. - **RL/avg_Q_trained_A01**: at 160 min — uni_5 -0.79, uni_6 -1.23 → uni_6 more negative. - **Performance/learner_percentage_training**: ~53% uni_5, ~52% uni_6 → **no gain**; performance ceiling. Smaller batch + faster speed: uni_5 vs uni_7 (relative time, common window up to 85 min) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Hock** (per-race explo): at 60 min — uni_5 69.27s, uni_7 60.58s; at 85 min — uni_5 65.04s, uni_7 58.76s → **uni_7 better**; reaches good Hock earlier. - **A01** (per-race eval): at 85 min — uni_5 24.58s, uni_7 24.58s (tie). - **Training/loss**: at 85 min — uni_5 355, uni_7 77.72 → **much lower** in uni_7. - **RL/avg_Q_trained_A01**: at 85 min — uni_5 -0.60, uni_7 -0.41 → **better** in uni_7. - **Performance/learner_percentage_training**: ~53% uni_5, ~78% uni_7 → **+25%** for uni_7. .. image:: ../_static/exp_training_speed_uni5_uni7_A01_best.jpg :alt: A01 best time by relative time (uni_5 vs uni_7) .. image:: ../_static/exp_training_speed_uni5_uni7_loss.jpg :alt: Training loss by relative time (uni_5 vs uni_7) Over-tuning: uni_7 vs uni_8 vs uni_9 (relative time, common window up to 70 min) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Hock** (per-race explo): at 70 min — uni_7 60.58s, uni_8 74.24s, uni_9 80.23s → **uni_7 better**. - **A01** (per-race eval): at 70 min — uni_7 24.63s, uni_8 25.18s, uni_9 25.25s → **uni_7 better**. - **Training/loss**: at 70 min — uni_7 92.4, uni_8 53.4, uni_9 45.4; uni_8/9 have lower raw loss but **race times and Q worse** ⇒ loss alone misleading. - **RL/avg_Q_trained_A01**: at 70 min — uni_7 -0.85, uni_8 -1.37, uni_9 -2.31 → **uni_7 better**. - **Performance/learner_percentage_training**: uni_7 ~79%, uni_8 ~80%, uni_9 ~82% — higher % does not yield better policies. uni_7 vs uni_10 (gpu_collectors_count, relative time, common window up to 85 min) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ uni_10 — same batch_size 512 and running_speed 512 as uni_7, but **gpu_collectors_count = 4** instead of 8. - **Hock** (per-race explo): at 70 min — uni_7 60.58s, uni_10 62.11s; at 85 min — uni_7 58.76s, uni_10 62.11s → **uni_7 better** by end of common window. - **A01** (per-race eval): at 85 min — uni_7 24.58s, uni_10 24.72s → **uni_7 slightly better**. - **Training/loss**: at 85 min — uni_7 77.72, uni_10 100.69 — close over the window. - **RL/avg_Q_trained_A01**: at 85 min — uni_7 -0.41, uni_10 -0.65 — close. - **Performance/learner_percentage_training**: uni_7 ~79%, uni_10 ~59% → **throughput and GPU % substantially lower** for uni_10. **Conclusion for uni_10:** By relative time uni_7 is ahead on Hock and A01 by end of common window. Main difference — **fewer ops per unit time** and lower GPU %; for maximum training speed, 8 collectors are better. uni_7 vs uni_11 (map cycle 4–4 vs 256–256, relative time, common window up to 55 min) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ uni_11 — same batch_size 512 and running_speed 512 as uni_7, but map cycle **256 hock – 256 A01** (+ 1 eval on A01 at the end) instead of **4 hock – 4 A01**. uni_11 ~55 min; uni_7 ~86 min. - **Hock** (per-race explo): at 15 min — uni_7 89.48s, uni_11 64.58s (uni_11 faster early); at 55 min — uni_7 61.94s, uni_11 60.69s → **uni_11 slightly better** Hock at end of window. - **A01** (per-race eval): at 55 min — uni_7 24.72s, uni_11 25.08s → **uni_7 better** on A01. - **Training/loss**: at 55 min — uni_7 113.0, uni_11 106.0 — **close**. - **RL/avg_Q_trained_A01**: at 55 min — uni_7 -1.24, uni_11 -1.10 — **close**. - **Performance/learner_percentage_training**: uni_7 ~79% vs uni_11 ~70% → **uni_7 higher GPU %**. **Conclusion for uni_11:** At 55 min uni_11 has slightly better Hock (60.69s vs 61.94s), uni_7 better A01 (24.72s vs 25.08s). **Interpretation ambiguous** — 256–256 is not clearly better or worse; more data or longer runs would help. uni_7 vs uni_12 (map cycle 4–4 vs 64–64, relative time, common window up to 55 min) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ uni_12 — same batch_size 512 and running_speed 512 as uni_7, but map cycle **64 hock – 64 A01** (+ 1 eval on A01 at the end) instead of **4 hock – 4 A01**. uni_12 ~55 min; uni_7 ~86 min. - **Hock** (per-race explo): at 55 min — uni_7 61.94s, uni_12 61.68s → **close** (uni_12 slightly better). uni_12 gets first Hock finish later (19.4 min vs 11.1 min for uni_7) but reaches 61.68s by 55 min. - **A01** (per-race eval): **uni_12 converges faster** — 24.85s by 20 min (uni_7 24.95s only by 40 min). At 55 min — uni_7 24.72s, uni_12 24.85s → **uni_7 slightly better** at end of window; uni_12 **much faster** to good A01. - **Training/loss**: at 55 min — uni_7 113.0, uni_12 102.8 → **uni_12 lower** (better). - **RL/avg_Q_trained_A01**: at 55 min — uni_7 -1.24, uni_12 -0.71 → **uni_12 better** (less negative); at 20 min uni_12 -0.83 vs uni_7 -1.44. - **Performance/learner_percentage_training**: uni_7 ~79% vs uni_12 ~72% → uni_7 higher GPU %; uni_12 has **faster A01 convergence and better loss/Q**. **Conclusion for uni_12:** Over 55 min **uni_12 converges faster on A01** (24.85s by 20 min vs uni_7 by 40 min) and has **lower loss and better Q**. At 55 min uni_7 is slightly better on A01 (24.72s vs 24.85s) and Hock is close (61.94s vs 61.68s). Versus uni_11 (256–256), uni_12 is **clearly better** on A01 and convergence. **Recommendation:** Prefer **64 hock – 64 A01** over 4–4 and over 256–256 for this setup. .. image:: ../_static/exp_training_speed_uni7_uni12_A01_best.jpg :alt: A01 best time by relative time (uni_7 vs uni_12) .. image:: ../_static/exp_training_speed_uni7_uni12_loss.jpg :alt: Training loss by relative time (uni_7 vs uni_12) Other metrics (from :doc:`tensorboard_metrics`) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When comparing batch/speed setups, also check in TensorBoard or via ``scripts/extract_tensorboard_data.py``: - **Performance/transitions_learned_per_second** — training throughput; higher is better; reflects efficiency of the pipeline. - **Gradients/norm_before_clip_max** — training stability; spikes >100 indicate gradient explosion; should stay relatively stable. Compare runs to ensure no setup introduces instability. Full interpretation and “what to watch for” for all metrics: :doc:`tensorboard_metrics`. Configuration Changes ---------------------- **Training** (``training`` section): batch_size was varied (2048 baseline, 8192 in uni_6, 512 in uni_7, smaller in uni_8/uni_9). **Recommended:** 512. **Performance** (``performance`` section): running_speed was varied (160 baseline, 512 in uni_7, higher in uni_8/uni_9). **Recommended:** 512. **gpu_collectors_count**: uni_10 used 4 with same 512/512 — throughput and GPU % drop; for maximum training speed keep 8. **Map cycle** (``map_cycle`` section): uni_7 used **4 hock – 4 A01** (+ 1 eval on A01 at the end); uni_11 **256 hock – 256 A01**; uni_12 **64 hock – 64 A01** (+ 1 eval on A01 at the end). Same batch/speed as uni_7. By relative time **uni_12 (64–64)** converges faster and gives equal or better A01, loss, and Q over the common 55 min window than uni_7; clearly better than uni_11. **Recommended:** 64–64. .. code-block:: python # Recommended (uni_7-style) batch_size = 512 # not 2048, not 8192, not 64 running_speed = 512 # not 160, not 1024 gpu_collectors_count = 8 # 4 (uni_10) yields fewer ops/sec and lower GPU % Hardware -------- - **GPU**: RTX 5090 - **Parallel instances**: 8 collectors (uni_5–uni_9, uni_11, uni_12); uni_10 used 4 — less throughput - **System**: Same across all runs Conclusions ----------- Conclusions are given by **relative time** (minutes from run start); comparing by “last value” when runs have different duration is invalid. 1. **Larger batch (8192) hurts:** On the common window by relative time — worse best times, ~10× higher loss, GPU % no gain (~52%). Keep batch ≤ 2048 or use 512 for the “small batch + fast speed” regime. 2. **Smaller batch (512) + faster speed (512×) helps:** At the same minutes from start uni_7 reaches good times faster, much lower loss, better Q, ~78% GPU. Best observed trade-off. 3. **Over-tuning (smaller batch + higher speed) hurts again:** On the common window up to 70 min uni_8/uni_9 do not reach uni_7-level times and Q. **Golden middle** — uni_7 (512/512); extremes (e.g. 64/1024) degrade quality. 4. **Trade-offs:** Smaller batches need more frequent sync; very high speed can affect physics; going beyond the sweet spot (batch/speed) degrades performance. 5. **gpu_collectors_count (uni_10):** By relative time quality is close to uni_7; uni_10 sometimes slightly faster early on. **Main point** — drop in throughput and GPU % (~59% vs ~79%); for maximum training speed by wall clock, 8 collectors are better. 6. **Map cycle length (uni_11, uni_12):** **256–256 (uni_11):** At 55 min uni_11 better Hock (60.69s vs 61.94s), uni_7 better A01 (24.72s vs 25.08s); ambiguous. **64–64 (uni_12):** **uni_12 converges faster on A01** (24.85s by 20 min vs uni_7 by 40 min), lower loss and better Q over 55 min; at 55 min uni_7 slightly better A01 (24.72s vs 24.85s), Hock close (61.94s vs 61.68s). uni_12 **clearly better** than uni_11 on A01. **Recommendation:** Use **64 hock – 64 A01** over 4–4 and over 256–256. Recommendations --------------- - **Use batch_size = 512** and **running_speed = 512** as the default “golden middle” for this setup (IQN, RTX 5090). - **Use gpu_collectors_count = 8** for maximum throughput; 4 collectors (uni_10) reduce ops per unit time and GPU share on training (~59% vs ~78%). - **Avoid** very small batch (e.g. 64) with very high speed (e.g. 1024); it slows training and convergence. - **Do not** increase batch beyond 2048 (or 512 if using the fast-speed regime) without evidence of benefit. - **Monitor** training loss, map times, and Q-values; GPU % alone is not sufficient. - **Tune** batch_size and running_speed together; uni_7 (512/512) is a proven working point. - **Map cycle:** Prefer **64 hock – 64 A01** (uni_12): by relative time it converges faster on A01 and has lower loss and better Q than 4–4 (uni_7) over 55 min, and is clearly better than 256–256 (uni_11) on A01. Use 64–64 as the default map cycle for this setup. **When to change:** - Larger batches (1024–2048): if GPU memory or sync overhead is the bottleneck. - Slower speed (160–200): if physics or stability is critical. - Keep 512/512 unless you have a clear reason to move away. **Analysis Tools:** - Activate venv (Windows): ``.\.venv\Scripts\activate`` - **By relative time** (compare at the same minutes from start; **2+ runs**): ``python scripts/analyze_experiment_by_relative_time.py uni_5 uni_7 --interval 5`` or ``python scripts/analyze_experiment_by_relative_time.py uni_5 uni_6 uni_7 uni_10 uni_11 uni_12 --interval 5`` (``--logdir ""`` if not from project root). Output: per-race tables (best/mean/std, finish rate, first finish) then scalar metrics. - By “last value” (less meaningful when durations differ): ``python scripts/analyze_experiment.py uni_5 uni_6 uni_7 ...`` - ``scripts/extract_tensorboard_data.py`` — selective metrics (``Gradients/norm_before_clip_max``, ``Performance/transitions_learned_per_second``, etc.). - **Key metrics** (see :doc:`tensorboard_metrics`): Per-race ``Race/eval_race_time_*``, ``Race/explo_race_time_*``; scalars ``Training/loss``, ``alltime_min_ms_{map}``, ``RL/avg_Q_*``, ``Performance/learner_percentage_training``, ``Gradients/norm_before_clip_max`` (stability).