BC pretrain: training length & early stop (Level 1)

Experiment Overview

This experiment compares two BC (behavioral cloning) pretrain runs to answer: how long to train, and whether early stopping is beneficial. Both runs use the same data and architecture (backbone mode, encoder init from Level 0 visual), but:

v1: Stopped at 25 epochs (early stopping enabled in config, or 25-epoch cap). Lower final val_loss (1.038) at stop.
v1.1: No early stopping, lr=0.0005, full 50 epochs. Slightly higher final val_loss (1.119); train_acc a bit higher (0.699 vs 0.696).

Goal: understand if we can use early stopping to save time without hurting downstream RL, and how per-action accuracy (e.g. left, right, accel, brake) evolves. When using multi-offset BC (bc_time_offsets_ms with several values), per-offset accuracy is logged (e.g. train_acc_offset_ms_-10, val_acc_offset_ms_0, val_acc_offset_ms_10) in TensorBoard and metrics.csv for analysis in doc/exp.

A second experiment (v1.1 vs v1.2) compares image normalization: v1.2 uses IQN-style normalization (x - 0.5) / 0.5 at BC input (cache is [0,1]), while v1.1 uses default [0, 1]. Both use the same visual backbone (Level 0 vis/v1) trained without IQN normalization, so v1.2 tests whether normalizing inputs at BC time improves validation loss and generalization despite the backbone having been pretrained on [0,1].

A third experiment (full IQN-aligned chain v2) trains a new visual backbone with IQN normalization (vis v2), then runs BC with that backbone and IQN normalization (BC v2). This removes the distribution mismatch: both Level 0 and Level 1 use (x - 0.5) / 0.5. Comparison v1.2 vs v2 shows whether full alignment (vis + BC both with IQN norm) improves over “BC-only” IQN norm (v1.2 with vis/v1).

Why IQN normalization exists (RL) and why pretrain should match it

In the RL pipeline, image inputs to the IQN network are always normalized as (x - 128) / 128: raw pixels are uint8 [0, 255], and the first step in the learner (e.g. in buffer_utilities.py when building batches and in agents/iqn.py in the inference path) is (img - 128) / 128. So the network sees inputs in roughly [-1, 1] (zero-centered, bounded).

Why this is used in IQN:

Zero-centering: Input mean ~0 (mid-gray 128 -> 0). The first layer does not need to compensate for a constant positive bias; gradients and learning are more stable.
Bounded symmetric range: Many architectures and initializations assume inputs in a bounded range; [-1, 1] is standard and avoids saturation at 0 or 255.

Why use the same convention in pretrain/BC:

When we transfer a pretrained encoder into IQN, the encoder is placed inside the same forward path: it will receive (x - 128) / 128 at RL time. If we trained the encoder on [0, 1] (e.g. v1.1: cache /255, no extra transform), then at transfer the encoder would see [-1, 1] — a distribution shift. The network was optimized for one input distribution and is then fed another, which hurts generalization (and can hurt RL sample efficiency). If we train BC (and optionally the visual backbone) with IQN-style normalization, we use (x - 0.5) / 0.5 on the [0, 1] cache, which maps to the same [-1, 1] range as (x - 128) / 128 for x in [0, 255]. So at transfer there is no distribution shift: the encoder sees the same input distribution it was trained on.

How the experiments align with this:

v1.1 (no IQN norm): Encoder and BC head trained on [0, 1]. At transfer to IQN they would see [-1, 1]. Shift -> worse generalization (val_loss 1.12, overfitting). Raw val_acc can still be high (0.625) because we are fitting the [0,1] training distribution well.
v1.2 (IQN norm at BC): Backbone gets fine-tuned on (x-0.5)/0.5 = [-1, 1] during BC; at transfer IQN feeds [-1, 1]. No shift -> better val_loss (0.95), better main_actions at best epoch.
v2 (full chain): Backbone and BC both trained on [-1, 1]. At transfer, same. No shift -> similar val_loss to v1.2; trade-offs are in which actions (e.g. coast) are predicted slightly better.

So the normalization in RL is there for training stability and a fixed input convention; using the same convention in pretrain removes distribution shift at transfer and matches the observed better generalization (v1.2/v2 vs v1.1).

Results

Important: v1 has 25 epochs (0–24), v1.1 has 50 epochs (0–49). Comparisons below are by epoch over the common window (0–24). At epoch 24, v1 is at its final checkpoint; v1.1 continues to epoch 49.

Key findings

Val loss: At epoch 24, v1 val_loss = 1.038, v1.1 val_loss = 1.045. v1 (early stop) has slightly better val_loss at the same epoch. By epoch 49, v1.1 val_loss = 1.119 (worse than v1 at 24), so training past ~25 epochs increases val_loss (overfitting).
Overall accuracy: At epoch 24, v1 val_acc ≈ 0.624, v1.1 ≈ 0.619. Final v1.1 (epoch 49) val_acc ≈ 0.625 — almost no gain from extra epochs.
Per-action accuracy (validation): Main predictive power is in accel, left+accel, right+accel, coast, left+accel+brake, right+accel+brake. Actions left, right, brake, left+brake, right+brake, accel+brake have near-zero or zero samples in validation, so val_acc_class_* is 0 or very small.
Combined (loss + per-action accuracy): The script reports best epoch by val_loss and at that epoch: val_loss, val_acc, and main_actions_val_acc (mean over the six actions above). For v1 the minimum val_loss is at epoch 14 (val_loss 0.997, main_actions_val_acc 0.49); for v1.1 at epoch 11 (val_loss 0.994, main_actions_val_acc 0.42). At later epochs (e.g. 24), overall val_acc and per-action accuracies are higher (e.g. v1 at 24: val_acc 0.62, accel/left+accel ~0.71, coast ~0.58), while val_loss is slightly higher. So: stopping at min val_loss gives the lowest loss but lower per-action accuracy; stopping around 20–25 epochs gives a better trade-off (good loss and higher accuracy on main actions). Prefer early stopping with patience 5–10 so the run stops after val_loss flattens but before overfitting, typically around 20–25 epochs.
Conclusion: Early stopping around 20–25 epochs is sufficient; training to 50 epochs (v1.1) does not improve val loss or val accuracy and leads to overfitting. Use early stopping (e.g. on val_loss with patience 5–10) to save time. When comparing runs, consider both val_loss and per-action accuracy (or main_actions_val_acc).

Normalization (v1.1 vs v1.2):

Val loss: v1.2 (IQN normalization) achieves substantially lower final val_loss (0.949 vs 1.119) and better best-epoch val_loss (0.941 at epoch 38 vs 0.994 at epoch 11). Normalization helps generalization (lower validation loss).
Val accuracy: At last epoch, overall val_acc is slightly lower in v1.2 (0.618 vs 0.625); at best epoch by val_loss, v1.2 has higher main_actions_val_acc (0.52 vs 0.42). So loss improves clearly; accuracy is similar or better at the best checkpoint.
Per-action: v1.2 improves accel (0.80 vs 0.71) and coast (0.45 vs 0.41); left+accel and right+accel+brake are slightly lower at epoch 49. The main takeaway is better val_loss with IQN normalization.
Caveat: Both runs use a visual backbone (vis/v1) trained without IQN normalization. So there’s a distribution mismatch: the backbone saw [0,1] during Level 0 pretrain; in v1.2 the BC input is transformed to (x-0.5)/0.5 before the backbone. Despite this mismatch, IQN normalization at BC time helps: lower val_loss and better best-epoch metrics. For full alignment, the next step would be to pretrain the visual backbone with IQN normalization as well, then run BC with IQN normalization.

Full chain comparison (v1.2 vs v2) — which is best for action prediction accuracy?

Overall: v1.2 and v2 are very close (val_loss 0.948–0.949, val_acc 0.616–0.618). At best epoch by val_loss, v1.2 has higher main_actions_val_acc (0.517 vs 0.507), so v1.2 is marginally better for aggregate action prediction accuracy.
Per-action: v2 is clearly better for coast (0.64 vs 0.45); v1.2 is slightly better on accel, left+accel+brake. So: for best overall action accuracy use v1.2; for best coast use v2.

Three-way comparison (v1.1 vs v1.2 vs v2) — action prediction accuracy:

Overall val_acc (last epoch): v1.1 0.625 > v1.2 0.618 > v2 0.616. v1.1 has the highest raw val_acc at epoch 49.
Val_loss (generalization): v1.2 and v2 ~0.948 (best); v1.1 1.119 (worst). So v1.1 overfits more.
Best epoch by val_loss: v1.1 best at epoch 11 (val_loss 0.994, main_actions_val_acc 0.42); v1.2 and v2 best at epoch 38 (main_actions_val_acc 0.517 and 0.507). main_actions_val_acc (mean over accel, left+accel, right+accel, coast, left+accel+brake, right+accel+brake) is best for v1.2 at the best checkpoint.
Per-action (epoch 49): v1.1 leads on left+accel (0.70), right+accel (0.53), right+accel+brake (0.33); v1.2 leads on accel (0.80), left+accel+brake (0.42); v2 leads on coast (0.64). v1.1 has lowest coast (0.41).
Summary: For highest overall val_acc at the end of training, v1.1 wins (0.625) but has worst val_loss. For best balance of generalization and main-actions accuracy (best epoch), v1.2 is best. For best coast prediction, v2. Reproduce with: python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.1 output/ptretrain/bc/v1.2 output/ptretrain/bc/v2 --interval 5.

Run Analysis

v1: BC pretrain from output/ptretrain/vis/v1/encoder.pt, bc_mode=backbone, 25 epochs (stopped early or 25-epoch config). Val_loss final 1.038, val_acc 0.624. CSV: output/ptretrain/bc/v1/csv/metrics.csv (or csv/version_0/metrics.csv).
v1.1: Same encoder init and bc_mode, no early stopping, lr=0.0005, 50 epochs, image_normalization default [0,1]. Val_loss final 1.119, val_acc 0.625. CSV: output/ptretrain/bc/v1.1/csv/metrics.csv.
v1.2: Same as v1.1 (encoder from vis/v1, 50 epochs, no early stopping) but image_normalization: “iqn” (x - 0.5) / 0.5 (cache [0,1]). Val_loss final 0.949, val_acc 0.618. CSV: output/ptretrain/bc/v1.2/csv/metrics.csv.
v2: Level 0 vis v2 (IQN norm, cache v0): 50 epochs, val_loss 0.207, encoder output/ptretrain/vis/v2/encoder.pt. Level 1 BC v2: encoder from vis/v2, IQN norm, cache v0; 50 epochs; val_loss 0.948, val_acc 0.616. CSV: output/ptretrain/bc/v2/csv/metrics.csv.
v2_next_tick: Same as v2 but bc_target: next_tick (predict action at next timestep). Config: config_files/pretrain/bc/pretrain_config_bc_v2_next_tick.yaml. Run: python scripts/pretrain_bc.py --config config_files/pretrain/bc/pretrain_config_bc_v2_next_tick.yaml.
v2_multi_offset: Same base as v2_next_tick; bc_time_offsets_ms: [-10, 0, 10, 100] and bc_offset_weights: [0.2, 1.0, 0.5, 0.3] (four heads: past 10 ms, current, +10 ms, +100 ms). Config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset.yaml. Run: python scripts/pretrain_bc.py --config config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset.yaml. CSV and TensorBoard include per-offset accuracy (e.g. val_acc_offset_ms_-10, val_acc_offset_ms_0, val_acc_offset_ms_10, val_acc_offset_ms_100) for doc/exp.
v2_multi_offset_ahead: Same as v2_multi_offset but use_actions_head: true (vis backbone + IQN A_head–style MLP heads only; no full IQN, no float head). Config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead.yaml. Run: .\.venv\Scripts\python.exe scripts/pretrain_bc.py --config config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead.yaml. Saves actions_head.pt (offset 0) for RL merge. See “Experiment: vis backbone + a_head only (v2_multi_offset_ahead vs v2_multi_offset)” for comparison.
v2_multi_offset_ahead_tuned: Same as v2_multi_offset_ahead with tuned hyperparameters: lr: 0.0002, weight_decay: 0.01 (AdamW), early_stopping: true, patience: 10. Config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead_tuned.yaml. Improves val_acc and per-offset accuracy vs untuned ahead; stops around epoch 26. See “Experiment: vis backbone + a_head only” for comparison.
v2_multi_offset_ahead_reg: Same as v2_multi_offset_ahead with reg-only tuning, no early stop: lr: 0.0001, weight_decay: 0.02, early_stopping: false, full 50 epochs. Config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead_reg.yaml. Reaches val_acc 0.592 at epoch 49, best val_loss at epoch 35; per-offset ~0.597–0.598. Use when you want full 50 epochs without early stopping.
v2_multi_offset_ahead_dropout: Same as v2_multi_offset_ahead_reg + dropout: 0.2 on features before action head. Config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead_dropout.yaml. Best A_head variant so far: val_acc 0.595, val_loss 1.989 (epoch 49); per-offset 0.603 / 0.600 / 0.600 / 0.578. Use for RL merge when you want best val without early stop.
v2_multi_offset_ahead_dropout_inner: Same as v2_multi_offset_ahead_dropout + action_head_dropout: 0.1 (dropout between the two Linear layers of A_head). Config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead_dropout_inner.yaml. Best A_head run: val_acc 0.597, val_loss 1.971; per-offset 0.605 / 0.603 / 0.602 / 0.579; main_actions_val_acc 0.476, coast 0.374. Recommended for RL merge.

Multi-offset analysis: For runs with bc_time_offsets_ms (e.g. v2_multi_offset), scripts/analyze_pretrain_bc.py prints a Per-offset validation accuracy block (last epoch and best epoch by val_loss). Compare v2_next_tick vs v2_multi_offset to see whether predicting several time offsets helps overall or per-offset accuracy. Example: python scripts/analyze_pretrain_bc.py --base-dir output/ptretrain/bc v2_next_tick v2_multi_offset --interval 5.

Detailed Metrics Analysis (by Epoch)

Methodology: Metrics are compared at epoch checkpoints (0, 5, 10, 15, 20, 24) over the common epoch window (0–24). Source: Lightning CSV; use scripts/analyze_pretrain_bc.py to reproduce. Per-action accuracy is reported with action names (accel, left+accel, right+accel, coast, left, right, brake, etc.), not only class indices.

Train / val loss and overall accuracy at checkpoints

v1: epoch 0 — train_loss 1.31, val_loss 1.07, val_acc 0.57; epoch 10 — 0.85 / 1.02 / 0.61; epoch 24 — 0.79 / 1.04 / 0.62.
v1.1: epoch 0 — 1.30 / 1.09 / 0.59; epoch 10 — 0.89 / 1.00 / 0.59; epoch 24 — 0.83 / 1.05 / 0.62.

At epoch 24, v1 has lower val_loss (1.04 vs 1.05) and similar val_acc. v1.1 trained to 50 epochs ends with higher val_loss (1.12).

Combined analysis (loss + per-action accuracy)

Best epoch by val_loss (script analyze_pretrain_bc.py):

v1: best epoch = 14 (val_loss 0.997, val_acc 0.61, main_actions_val_acc 0.49).
v1.1: best epoch = 11 (val_loss 0.994, val_acc 0.60, main_actions_val_acc 0.42).

At these epochs val_loss is minimal but main_actions_val_acc (mean over accel, left+accel, right+accel, coast, left+accel+brake, right+accel+brake) is lower than at epoch 20–24. So stopping purely at min val_loss yields the best loss but worse per-action accuracy; a better compromise is to stop around 20–25 epochs when both loss and per-action acc are good.

Per-action validation accuracy (action names)

At last epoch (v1 at 24, v1.1 at 49):

accel (0): v1 0.71, v1.1 0.71
left+accel (1): v1 0.71, v1.1 0.70
right+accel (2): v1 0.52, v1.1 0.53
coast (3): v1 0.58, v1.1 0.41 (v1.1 drops with more training)
left (4), right (5), brake (6), left+brake (7), right+brake (8), accel+brake (9): near zero in both (very few or no val samples)
left+accel+brake (10): v1 0.39, v1.1 0.41
right+accel+brake (11): v1 0.33, v1.1 0.33

So the model learns accel, left+accel, right+accel, coast, and the two accel+brake turn actions; rare actions (brake-only, left/right without accel) stay at 0. Training longer (v1.1 to 50 epochs) does not improve these and can slightly hurt coast.

Per-action accuracy vs training (epoch)

The figure below shows validation accuracy for each action (accel, left+accel, right+accel, coast, etc.) over epochs for v1 and v1.1. Main actions (0, 1, 2, 3, 10, 11) show clear learning curves; rare actions (4–9) stay near zero.

Per-action validation accuracy vs epoch (v1 and v1.1, 12 actions)

Experiment: IQN-style image normalization (v1.1 vs v1.2)

Overview: v1.2 is identical to v1.1 except image_normalization: “iqn” (input (x - 0.5) / 0.5 on [0,1] cache). Both use the same visual backbone (vis/v1) trained without IQN normalization. Comparison is by epoch (same 50 epochs, same data and seed).

Metrics at epoch checkpoints (from ``analyze_pretrain_bc.py``):

Epoch 0: v1.1 train_loss 1.30, val_loss 1.09, val_acc 0.59; v1.2 train_loss 1.82, val_loss 1.12, val_acc 0.57 — v1.2 starts with higher train loss (normalization changes input distribution; backbone was trained on [0,1]).
Epoch 20: v1.1 val_loss 1.02, val_acc 0.61; v1.2 val_loss 0.97, val_acc 0.61 — v1.2 pulls ahead on val_loss.
Epoch 49: v1.1 val_loss 1.12, val_acc 0.625; v1.2 val_loss 0.949, val_acc 0.618 — v1.2 has ~15% lower val_loss; val_acc nearly the same.

Best epoch by val_loss:

v1.1: best epoch = 11, val_loss = 0.994, main_actions_val_acc = 0.42.
v1.2: best epoch = 38, val_loss = 0.941, main_actions_val_acc = 0.52 — both better val_loss and better per-action accuracy at the best checkpoint.

Per-action validation accuracy (last epoch):

accel: v1.1 0.71, v1.2 0.80
left+accel: v1.1 0.70, v1.2 0.56
right+accel: v1.1 0.53, v1.2 0.52
coast: v1.1 0.41, v1.2 0.45
left+accel+brake / right+accel+brake: similar or slightly lower in v1.2 at epoch 49.

Conclusion (normalization): IQN-style normalization at BC input reduces validation loss clearly (~15% at 50 epochs) and gives a better best-epoch (lower val_loss and higher main_actions_val_acc). Overall val_acc at the last epoch is slightly lower in v1.2, but the model generalizes better in terms of loss. Recommendation: use image_normalization: “iqn” for BC pretrain when the encoder will be loaded into IQN. Note: the visual backbone in both experiments was pretrained without IQN normalization; aligning Level 0 pretrain with IQN normalization in a future run may yield further gains.

Experiment: Multi-offset BC (v2_multi_offset)

Goal: Evaluate multi-offset BC (predict action at several time offsets from the last frame: -10 ms, 0 ms, +10 ms, +100 ms) to see whether auxiliary heads improve learning and whether per-offset accuracy is useful for analysis.

Config: Base = config_files/pretrain/bc/pretrain_config_bc_v2_next_tick.yaml (vis v2, IQN norm, next_tick). Minimal override: run_name: v2_multi_offset, bc_time_offsets_ms: [-10, 0, 10, 100], bc_offset_weights: [0.2, 1.0, 0.5, 0.3]. Full config: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset.yaml.

Cache: When bc_time_offsets_ms has more than one value, a full BC cache is built (train.npy, train_actions.npy with shape (N, n_offsets), val.*, cache_meta.json). After build, consistency is verified: train_actions.npy shape must match (n_train, len(bc_time_offsets_ms)) and meta must contain bc_time_offsets_ms, n_actions, n_train, n_val, source_signature. Re-running with the same offsets reuses the cache.

How to run:

# After vis v2 is trained (encoder at output/ptretrain/vis/v2/encoder.pt):
.\.venv\Scripts\python.exe scripts/pretrain_bc.py --config config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset.yaml

Outputs: Run directory output/ptretrain/bc/v2_multi_offset/ with encoder.pt, pretrain_meta.json, csv/metrics.csv (or csv/version_0/metrics.csv), and TensorBoard logs. Meta includes bc_time_offsets_ms and bc_offset_weights. CSV columns include train_acc_offset_ms_-10, val_acc_offset_ms_0, val_acc_offset_ms_10, val_acc_offset_ms_100 (and train variants).

How to analyze (doc/exp):

.\.venv\Scripts\python.exe scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2_multi_offset output/ptretrain/bc/v2 output/ptretrain/bc/v2_next_tick --interval 5

Use documented values for v2 and v2_next_tick: the directory output/ptretrain/bc/v2_next_tick/ was overwritten by the n_stack=3 run (see “Experiment: n_stack 1 vs 3”), so its logs are for stack3, not the original next_tick. For v2_next_tick (n_stack=1) use the numbers from the “Experiment: BC target current_tick vs next_tick” and “Experiment: n_stack 1 vs 3” subsections below. For v2 (current_tick) use “Run Analysis (v2 chain)” and “Experiment: BC target current_tick vs next_tick”.

The script prints Per-offset validation accuracy for v2_multi_offset (last epoch and best epoch by val_loss). Compare val_acc_offset_ms_0 to v2 (both are offset 0 / current tick); compare val_acc_offset_ms_10 to v2_next_tick (both are next tick, +10 ms).

Semantics of baselines:

v2 (config pretrain_config_bc_v2.yaml): bc_target: current_tick → predicts action at the same tick as the image = offset 0 (MDP-aligned π(a_t|s_t)). Documented: val_acc 0.616 (last epoch), best epoch 38.
v2_next_tick (config pretrain_config_bc_v2_next_tick.yaml): bc_target: next_tick → predicts action at the next tick = offset +10 ms (π(a_{t+1}|s_t)). Documented (n_stack=1, before overwrite): val_acc 0.552, best epoch 18. Do not use metrics from the current v2_next_tick directory (they are for stack3).

Results and interpretation (v2_multi_offset vs v2 and v2_next_tick, 50 epochs; baseline values from documentation):

Main prediction (offset 0, current tick): val_acc_offset_ms_0 in v2_multi_offset is the accuracy of the head that predicts the action at the current frame (same as v2). In the run with action-timeline labels (cache rebuilt after switching to manifest "actions" + step_ms), v2_multi_offset achieved val_acc_offset_ms_0 = 0.620 (last epoch) vs v2 (current_tick, single head) val_acc 0.616 (doc / same run). Multi-offset training gives comparable or slightly better main (0-offset) prediction than the single-head current_tick baseline; the shared backbone benefits from the multi-task signal.
Forward offset +10 ms (next tick): val_acc_offset_ms_10 corresponds to the same target as v2_next_tick. v2_multi_offset val_acc_offset_ms_10 = 0.618 vs v2_next_tick documented val_acc 0.552 (n_stack=1). So the multi-offset head at +10 ms is much higher than the single-head next_tick model; multi-task training helps the +10 ms head.
Difference between offsets (with action timeline): With labels from the action timeline (not closest frame), per-offset accuracies at the last epoch differ as expected: val_acc_offset_ms_-10 = 0.623, val_acc_offset_ms_0 = 0.620, val_acc_offset_ms_10 = 0.618, val_acc_offset_ms_100 = 0.590. Past and current are slightly easier than +10 ms; +100 ms is clearly harder (~3 pp lower), so predicting action 100 ms ahead is more difficult. This confirms that 0 vs +10 ms targets are now distinct and the metric reflects real offset difficulty.

Action timings (not frame timings): Multi-offset labels use the action timeline from the manifest (manifest.json "actions" + metadata.json step_ms) when present: action at T+d ms = actions[round((T+d)/step_ms)]. So offset 0 and +10 ms refer to actual game-step actions (what was pressed at that time), not “closest captured frame”. That gives distinct targets for 0 vs +10 ms regardless of capture FPS. If a replay has no "actions" or no step_ms, the code falls back to “closest entry by time_ms” (frame-based), which at low FPS can make 0 and +10 identical.
Loss note: v2_multi_offset has higher reported val_loss (e.g. ~1.93) than v2 (~0.95) because the loss is a weighted sum of four cross-entropies (one per offset). So val_loss is not directly comparable; use val_acc and val_acc_offset_ms_* for comparison.
Takeaway: Multi-offset BC with action-timeline labels yields distinct per-offset accuracies: -10/0/10 ms similar (~0.62), +100 ms lower (~0.59). The 0-offset head is on par with or slightly better than single-head v2 (current_tick). The +10 ms head is much better than single-head v2_next_tick. When comparing to baselines, use documented values for v2 and v2_next_tick, not the current v2_next_tick directory (overwritten by stack3).

Interpretation (generic): Compare val_acc_offset_ms_0 to v2 (current_tick, single head) and val_acc_offset_ms_10 to v2_next_tick (next_tick, single head). If val_acc_offset_ms_0 is close to or higher than v2’s val_acc, the shared backbone benefits from multi-task learning. If val_acc_offset_ms_10 or val_acc_offset_ms_100 are high, the model learns to anticipate; if they are lower than offset 0, current-tick prediction remains the main signal.

Experiment: vis backbone + a_head only (v2_multi_offset_ahead vs v2_multi_offset)

Goal: Compare v2_multi_offset (four Linear heads) with v2_multi_offset_ahead (same data and offsets, but use_actions_head: true: four MLP heads in IQN A_head layout). The latter trains only vis backbone + action heads (no full IQN, no float head) and saves actions_head.pt for direct injection into RL IQN.

Configs:

v2_multi_offset: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset.yaml — run_name: v2_multi_offset, bc_time_offsets_ms: [-10, 0, 10, 100], bc_offset_weights: [0.2, 1.0, 0.5, 0.3], no use_actions_head (plain Linear heads).
v2_multi_offset_ahead: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead.yaml — same as above + use_actions_head: true, save_actions_head: true, dense_hidden_dimension: 1024; run_name: v2_multi_offset_ahead.

How to run (ahead):

.\.venv\Scripts\python.exe scripts/pretrain_bc.py --config config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead.yaml

Outputs (v2_multi_offset_ahead): output/ptretrain/bc/v2_multi_offset_ahead/ — encoder.pt, actions_head.pt (offset 0, IQN A_head layout), pretrain_meta.json, metrics.csv, TensorBoard. Same per-offset CSV columns as v2_multi_offset.

Run summary (50 epochs, from ``analyze_pretrain_bc.py``):

Metric | v2_multi_offset (Linear) | v2_multi_offset_ahead (MLP A_head) | v2_multi_offset_ahead_tuned |

|--------|————————–|-----------------------------------|—————————–| | train_acc (last epoch) | 0.693 | 0.792 | 0.678 | | val_acc (last epoch) | 0.612 | 0.552 | 0.591 | | train_loss (last epoch) | 1.574 | 1.049 | 1.647 | | val_loss (last epoch) | 1.932 | 2.884 | 2.080 | | Best epoch by val_loss | 49 | 14 | 16 | | val_loss at best epoch | 1.932 | 2.004 | 2.029 | | val_acc at best epoch | 0.612 | 0.594 | 0.591 | | main_actions_val_acc at best | 0.515 | 0.457 | 0.420 | | Epochs run | 50 | 50 | 27 (early stop) |

Per-offset validation accuracy (last epoch):

Offset (ms) | v2_multi_offset | v2_multi_offset_ahead | v2_multi_offset_ahead_tuned |

|-------------|—————–|------------------------|——————————| | -10 | 0.6227 | 0.5610 | 0.6007 | | 0 | 0.6200 | 0.5618 | 0.5941 | | 10 | 0.6176 | 0.5497 | 0.5957 | | 100 | 0.5896 | 0.5364 | 0.5753 |

v2_multi_offset_ahead at best epoch (14): val_acc_offset_ms_-10 = 0.600, val_acc_offset_ms_0 = 0.606, val_acc_offset_ms_10 = 0.595, val_acc_offset_ms_100 = 0.576. So even at its best, the A_head run stays ~1–2 pp below Linear on each offset.

Tuned hyperparameters (v2_multi_offset_ahead_tuned): Config pretrain_config_bc_v2_multi_offset_ahead_tuned.yaml uses lr: 0.0002 (lower than 0.0005), weight_decay: 0.01 (AdamW), and early_stopping: true with patience: 10. Training stops at epoch 27 (best val_loss at epoch 16). Val_acc 0.591 vs 0.552 (untuned ahead); per-offset accuracies 0.60 / 0.59 / 0.60 / 0.58 — much closer to Linear and no overfitting. Per-action at last epoch: accel 0.80, left+accel 0.55, right+accel 0.49, coast 0.15; balanced and suitable for RL merge. Use this config when you need A_head for RL and want better validation without manually picking a checkpoint.

Reg-only, no early stop (v2_multi_offset_ahead_reg): Config pretrain_config_bc_v2_multi_offset_ahead_reg.yaml uses lr: 0.0001, weight_decay: 0.02, early_stopping: false — full 50 epochs. Reaches val_acc 0.592 at epoch 49 (best val_loss at epoch 35 = 2.014). Per-offset at last epoch: 0.597 / 0.597 / 0.598 / 0.576 — best per-offset among A_head variants (before dropout). Train_acc 0.676 (no overfitting). Use when you want a full 50-epoch run without early stopping and comparable or slightly better val than tuned.

Dropout (v2_multi_offset_ahead_dropout): Config pretrain_config_bc_v2_multi_offset_ahead_dropout.yaml adds dropout: 0.2 on features before the action head (same lr/weight_decay as reg, no early stop). Best A_head run: val_acc 0.595, val_loss 1.989 at epoch 49; best val_loss 1.979 at epoch 40. Per-offset: 0.603 / 0.600 / 0.600 / 0.578 — highest among A_head variants and closest to Linear (0.623/0.620/0.618/0.590). Recommended for RL merge when you want best val without early stopping.

Dropout + inner (v2_multi_offset_ahead_dropout_inner): Config pretrain_config_bc_v2_multi_offset_ahead_dropout_inner.yaml adds action_head_dropout: 0.1 between the two Linear layers of A_head (plus dropout 0.2 on features). Best A_head so far: val_acc 0.597, val_loss 1.971 at epoch 49; per-offset 0.605 / 0.603 / 0.602 / 0.579; main_actions_val_acc 0.476, coast 0.374. Saved actions_head.pt is remapped to IQN layout (no dropout in file). Recommended for RL merge.

Interpretation:

v2_multi_offset (Linear) does not overfit by epoch 49: best epoch = 49, val_acc 0.612, per-offset accuracies ~0.59–0.62. Higher capacity is not needed for this data.
v2_multi_offset_ahead (MLP A_head) overfits strongly: best epoch by val_loss = 14; afterwards val_loss rises (2.0 → 2.88) and val_acc falls (0.59 → 0.55) while train_acc keeps rising to 0.79. The MLP head has more parameters and fits the training set better but generalizes worse.
Per-action (epoch 49): Linear: accel 0.78, left+accel 0.59, right+accel 0.54, coast 0.43, left+accel+brake 0.42, right+accel+brake 0.33. A_head: accel 0.50, left+accel 0.76, right+accel 0.53, coast 0.41, left+accel+brake 0.31, right+accel+brake 0.27. A_head trades accel for left+accel and is worse on coast and brake-turn actions; overall val and main_actions are lower.

Recommendations:

Best validation quality (multi-offset): Use v2_multi_offset (Linear heads). It gives better val_acc and per-offset accuracy without overfitting.
RL merge (vis + A_head only): Use v2_multi_offset_ahead only if you need actions_head.pt for IQN injection. In that case enable early stopping (e.g. patience 5–10 on val_loss) or take the checkpoint at best epoch ~14; the default save at the end of 50 epochs is from an overfit model (val_acc 0.55, best was 0.59 at epoch 14).
RL merge with better val (recommended): Use v2_multi_offset_ahead_tuned (config pretrain_config_bc_v2_multi_offset_ahead_tuned.yaml): lower lr (0.0002), weight_decay (0.01), and early_stopping. Reaches val_acc 0.591 and per-offset ~0.59–0.60, stops around epoch 26; per-action balance is sane (accel 0.80, no collapse). Saves actions_head.pt for RL merge. Alternative without early stop: v2_multi_offset_ahead_reg (pretrain_config_bc_v2_multi_offset_ahead_reg.yaml): lr 0.0001, weight_decay 0.02, full 50 epochs — val_acc 0.592, best val_loss at epoch 35, per-offset ~0.597–0.598. Best A_head (no early stop): v2_multi_offset_ahead_dropout (pretrain_config_bc_v2_multi_offset_ahead_dropout.yaml): reg + dropout: 0.2 — val_acc 0.595, val_loss 1.989, per-offset ~0.60–0.603. Best overall A_head: v2_multi_offset_ahead_dropout_inner (pretrain_config_bc_v2_multi_offset_ahead_dropout_inner.yaml): dropout 0.2 + action_head_dropout: 0.1 — val_acc 0.597, val_loss 1.971, per-offset ~0.605–0.603.
Reproduce: Run the analysis script to get full per-epoch and per-action tables:

.\.venv\Scripts\python.exe scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2_multi_offset output/ptretrain/bc/v2_multi_offset_ahead output/ptretrain/bc/v2_multi_offset_ahead_tuned output/ptretrain/bc/v2_multi_offset_ahead_reg --interval 5

Experiment: Full IQN-aligned chain (vis v2 + BC v2)

Overview: This experiment chain uses two config files so that both Level 0 (visual) and Level 1 (BC) use IQN-style normalization end-to-end.

Visual backbone v2 — config config_files/pretrain/vis/pretrain_config_vis_iqn.yaml: run_name: v2, image_normalization: "iqn". Output: output/ptretrain/vis/v2/encoder.pt.
BC v2 — config config_files/pretrain/bc/pretrain_config_bc_v2.yaml: run_name: v2, encoder_init_path: output/ptretrain/vis/v2/encoder.pt, image_normalization: "iqn". Output: output/ptretrain/bc/v2/.

Level 0 (visual) v2 results: Vis v2 was trained with IQN normalization and preprocess_cache_dir: cache/v0; 50 epochs (no early stopping). Final train_loss 0.206, val_loss 0.207. Encoder saved to output/ptretrain/vis/v2/encoder.pt.

Commands:

# 1. Train visual backbone with IQN normalization (output: output/ptretrain/vis/v2/)
python scripts/pretrain_visual_backbone.py --config config_files/pretrain/vis/pretrain_config_vis_iqn.yaml

# 2. Train BC with vis v2 encoder and IQN normalization (output: output/ptretrain/bc/v2/)
python scripts/pretrain_bc.py --config config_files/pretrain/bc/pretrain_config_bc_v2.yaml

Analysis (after BC v2 completes): Compare v1.2 (BC with IQN norm, backbone from vis/v1 without IQN) vs v2 (full alignment):

python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.2 output/ptretrain/bc/v2 --interval 5

Run Analysis (v2 chain)

vis v2: Level 0 with image_normalization: "iqn", preprocess_cache_dir: cache/v0; 50 epochs (no early stopping); train_loss 0.206, val_loss 0.207. CSV: output/ptretrain/vis/v2/csv/metrics.csv.
BC v2: Encoder from vis/v2, image_normalization: "iqn", cache v0; 50 epochs. Final val_loss 0.948, val_acc 0.616. CSV: output/ptretrain/bc/v2/csv/metrics.csv.
BC v2_next_tick: Same as BC v2 but bc_target: next_tick (config pretrain_config_bc_v2_next_tick.yaml). Original run (n_stack=1): final val_loss 1.116, val_acc 0.552 (this run was later overwritten by the stack3 experiment; metrics are in the “Experiment: n_stack 1 vs 3” subsection). Current dir contains the stack3 run (n_stack=3): val_loss 1.074, val_acc 0.552; see pretrain_meta.json for n_stack: 3.

Detailed Metrics Analysis (v1.2 vs v2, by epoch)

Methodology: Metrics are compared by epoch (same 50 epochs for BC). Source: Lightning CSV; use scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.2 output/ptretrain/bc/v2 --interval 5 to reproduce. Report train/val loss, val_acc at checkpoints; best epoch by val_loss (val_loss, val_acc, main_actions_val_acc); and per-action validation accuracy with action names.

Results (v1.2 vs v2):

Summary: Both 50 epochs. v1.2: train_acc 0.706, val_acc 0.618, train_loss 0.763, val_loss 0.949. v2: train_acc 0.701, val_acc 0.616, train_loss 0.776, val_loss 0.948. Val_loss is virtually the same; v1.2 has slightly higher overall val_acc.
Best epoch by val_loss: Both best at epoch 38. v1.2: val_loss 0.9411, val_acc 0.6159, main_actions_val_acc 0.5169. v2: val_loss 0.9438, val_acc 0.6153, main_actions_val_acc 0.5072. So v1.2 has slightly better main_actions_val_acc at the best checkpoint (better for action prediction accuracy on the six main actions).
Per-action val_acc (last epoch, epoch 49):
- accel: v1.2 0.801, v2 0.797 — v1.2 slightly higher.
- left+accel: v1.2 0.564, v2 0.562 — tied.
- right+accel: v1.2 0.524, v2 0.522 — tied.
- coast: v1.2 0.451, v2 0.639 — v2 much better on coast.
- left+accel+brake: v1.2 0.419, v2 0.394 — v1.2 higher.
- right+accel+brake: v1.2 0.292, v2 0.301 — v2 slightly higher.

Conclusion for action prediction accuracy: v1.2 is marginally better for overall val_acc and main_actions_val_acc (mean over the six main actions) at the best epoch. v2 is clearly better for coast (0.64 vs 0.45). For best overall action prediction accuracy, prefer v1.2 (backbone from vis/v1 + IQN norm at BC). If coast prediction is a priority, v2 (full IQN-aligned chain) is better. Val_loss is effectively tied (0.948–0.949).

Experiment: BC target current_tick vs next_tick (v2 vs v2_next_tick)

Overview: This experiment compares two BC v2 runs that differ only in bc_target: v2 uses bc_target: current_tick (predict action at the same tick as the image), v2_next_tick uses bc_target: next_tick (predict action at the next tick). Same encoder (vis v2), same data, same hyperparameters (50 epochs, batch 4096, lr 0.0005). Configs: config_files/pretrain/bc/pretrain_config_bc_v2.yaml vs config_files/pretrain/bc/pretrain_config_bc_v2_next_tick.yaml.

How action prediction accuracy changed (next_tick vs current_tick):

Overall validation accuracy (last epoch): v2 (current_tick) 0.616 vs v2_next_tick 0.552 — next_tick is ~6.4 pp lower.
Best epoch by val_loss: v2 best at epoch 38 (val_loss 0.944, main_actions_val_acc 0.507); v2_next_tick best at epoch 18 (val_loss 1.091, main_actions_val_acc 0.315). So next_tick has much lower per-action accuracy at the best checkpoint (~19 pp lower main_actions_val_acc).
Per-action validation accuracy (epoch 49):
- accel: v2 0.797, v2_next_tick 0.702 — next_tick worse.
- left+accel: v2 0.562, v2_next_tick 0.507 — next_tick worse.
- right+accel: v2 0.522, v2_next_tick 0.504 — similar.
- coast: v2 0.639, v2_next_tick 0.000 — next_tick has no coast accuracy (likely no or very few coast labels in the next-tick target distribution).
- left+accel+brake: v2 0.394, v2_next_tick 0.196 — next_tick much worse.
- right+accel+brake: v2 0.301, v2_next_tick 0.118 — next_tick much worse.

Conclusion: bc_target: next_tick substantially reduces action prediction accuracy compared to current_tick: lower overall val_acc, much lower main_actions_val_acc, and coast drops to 0 (next-tick targets may rarely be coast, or the task is harder). For best action prediction accuracy, use bc_target: current_tick (v2). Use next_tick only if the downstream use case explicitly requires predicting the next tick’s action.

Reproduce:

python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2 output/ptretrain/bc/v2_next_tick --interval 5

Experiment: n_stack 1 vs 3 (temporal stack, next_tick)

Overview: This experiment compares n_stack=1 (single frame per sample) vs n_stack=3 (three consecutive frames per sample) for BC pretrain with bc_target: next_tick. Goal: determine whether feeding multiple past images into the model improves action prediction accuracy enough to justify the extra memory and compute.

Time between frames: Replay frames are captured by scripts/capture_replays_tmnf.py. The interval between consecutive frames is 1000 / fps ms (--fps is frames per simulation second; see TMNF replay download and frame capture). Typical values: 10 FPS → 100 ms between frames; 64 FPS → ~15.6 ms. With n_stack=3, the three frames span (n_stack − 1) × interval in time: e.g. at 10 FPS the stack covers 2×100 = 200 ms of simulation time; at 64 FPS about 31 ms. The game engine step is 10 ms (ms_per_tm_engine_step in config); capture can be every step (100 FPS) or sub-sampled (e.g. 10 or 64 FPS). Check metadata.json in each replay dir for fps and step_ms to see the actual interval used for your data.

Configs:

Baseline (n_stack=1): config_files/pretrain/bc/pretrain_config_bc_v2_next_tick.yaml — n_stack: 1, preprocess_cache_dir: cache/v0, cache_load_in_ram: true, workers: 4.
Experimental (n_stack=3): config_files/pretrain/bc/pretrain_config_bc_v2_next_tick_stack3.yaml — n_stack: 3, preprocess_cache_dir: cache/v0_stack3, cache_load_in_ram: false, workers: 1.

Configuration changes (n_stack=3 vs n_stack=1):

n_stack: 1 → 3 (three consecutive frames per sample; model gets temporal context).
preprocess_cache_dir: cache/v0 → cache/v0_stack3 (separate cache because cache format depends on n_stack).
cache_load_in_ram: true → false (see Memory and RAM below).
workers: 4 → 1 (reduced to avoid OOM when using n_stack=3).

All other training settings are the same (batch_size 4096, epochs 50, lr 0.0005, same encoder init, image_normalization “iqn”, bc_target next_tick).

Memory and RAM

Using n_stack=3 significantly increases RAM usage:

Cache: Each sample is 3× larger (3 frames instead of 1). Preprocessed cache size grows by a factor of ~3. Loading the full cache into RAM (cache_load_in_ram: true) with n_stack=3 can cause out-of-memory errors on machines with limited RAM.
Training batch: Each batch holds 3× more pixel data per sample (e.g. batch_size 4096 × 3 × 1 × H × W). This increases GPU and host memory during training.
Mitigation in the stack3 config: cache_load_in_ram: false keeps the cache on disk (memory-mapped) instead of loading it entirely into RAM; workers: 1 reduces the number of DataLoader workers to lower peak RAM. If you still hit OOM, consider reducing batch_size or building the cache with a smaller dataset.

Run analysis

v2_next_tick (n_stack=1) — original run, overwritten: Config pretrain_config_bc_v2_next_tick.yaml. Metrics below are from documentation only (recorded before the run was overwritten). Final val_loss 1.116, val_acc 0.552; best epoch by val_loss = 18 (val_loss 1.091, main_actions_val_acc 0.315). Per-action (epoch 49): accel 0.702, left+accel 0.507, right+accel 0.504, coast 0.000, left+accel+brake 0.196, right+accel+brake 0.118.
Current ``output/ptretrain/bc/v2_next_tick/`` = v2_next_tick_stack3 (n_stack=3): The directory was overwritten by the stack3 run (config pretrain_config_bc_v2_next_tick_stack3.yaml). pretrain_meta.json in that dir shows n_stack: 3. Metrics: final val_loss 1.074, val_acc 0.552; best epoch by val_loss = 29 (val_loss 1.066, main_actions_val_acc 0.327). Per-action (epoch 49): accel 0.744, left+accel 0.504, right+accel 0.461, coast 0.000, left+accel+brake 0.173, right+accel+brake 0.113.

Comparison: documented v2_next_tick (n_stack=1) vs current run (n_stack=3)

Metric | v2_next_tick (n_stack=1, from doc) | Current dir (n_stack=3) |

|--------|———————————–|-------------------------| | Final val_loss | 1.116 | 1.074 (lower) | | Final val_acc | 0.552 | 0.552 (tie) | | Best epoch (by val_loss) | 18 | 29 | | Best-epoch val_loss | 1.091 | 1.066 (lower) | | Best-epoch main_actions_val_acc | 0.315 | 0.327 (slightly higher) | | accel (last epoch) | 0.702 | 0.744 (higher) | | left+accel (last epoch) | 0.507 | 0.504 (similar) | | right+accel (last epoch) | 0.504 | 0.461 (lower) | | left+accel+brake (last epoch) | 0.196 | 0.173 (lower) | | right+accel+brake (last epoch) | 0.118 | 0.113 (similar) |

Conclusion: With n_stack=3 (current overwritten run), val_loss is better (1.074 vs 1.116 at last epoch; 1.066 vs 1.091 at best epoch) and main_actions_val_acc at best epoch is slightly higher (0.327 vs 0.315). accel accuracy at the last epoch is clearly higher (0.744 vs 0.702). Some actions are slightly worse with n_stack=3 (right+accel, left+accel+brake). Overall, temporal stack (n_stack=3) gives a small improvement in validation loss and best-epoch main-actions accuracy, and a clear gain on accel, at the cost of much higher RAM usage (see Memory and RAM above). The gain may not justify the memory cost unless accel or val_loss is critical.

Reproduce comparison: The original n_stack=1 run was overwritten, so side-by-side script comparison is not possible. To reproduce metrics for the current (stack3) run: python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2_next_tick --interval 5. The n_stack=1 numbers in the table above are from the documentation (recorded before overwrite).

Results and conclusions

Whether multiple past frames help: The comparison above (documented v2_next_tick n_stack=1 vs current run n_stack=3) shows that n_stack=3 gives a small improvement: lower val_loss (1.074 vs 1.116), better best-epoch val_loss and main_actions_val_acc, and higher accel accuracy. Some per-action accuracies are slightly worse (right+accel, left+accel+brake). Conclusion: Temporal context (3 frames) helps a bit on loss and accel, but the gain is modest and comes with much higher RAM usage (see Memory and RAM).
Trade-off: n_stack=3 provides temporal information but increases RAM usage; the stack3 config uses cache_load_in_ram: false and workers: 1 to avoid OOM. Whether to use n_stack=3 depends on whether the small accuracy/loss gain justifies the memory cost.

Recommendations

If you have limited RAM, use n_stack=1 or ensure cache_load_in_ram: false and lower workers (and optionally batch_size) for n_stack=3.
After running both experiments with distinct run names, use scripts/analyze_pretrain_bc.py to compare by epoch and decide whether the gain from n_stack=3 justifies the memory and compute cost.

Configuration Changes (v2 chain)

Visual pretrain: config_files/pretrain/vis/pretrain_config_vis_iqn.yaml — run_name: v2, image_normalization: “iqn”, preprocess_cache_dir: cache/v0; other fields as in config_files/pretrain/vis/pretrain_config.yaml (early_stopping: false).
BC pretrain: config_files/pretrain/bc/pretrain_config_bc_v2.yaml — run_name: v2, encoder_init_path: output/ptretrain/vis/v2/encoder.pt, image_normalization: “iqn”, preprocess_cache_dir: cache/v0; same training hyperparameters as main BC config.

Hardware (v2 chain)

Same as other pretrain runs (single GPU, Windows).

Conclusions (v2 chain)

Level 0 v2: Training the visual backbone with IQN normalization and cache v0 yields a valid encoder (val_loss 0.207 at 50 epochs). Ready for BC v2.
BC v2 vs v1.2 — action prediction accuracy: v1.2 and v2 are very close on val_loss (0.948–0.949) and overall val_acc (0.616–0.618). At best epoch by val_loss (epoch 38), v1.2 has higher main_actions_val_acc (0.517 vs 0.507), so v1.2 is marginally better for overall action prediction accuracy. v2 is clearly better for coast (0.64 vs 0.45 at epoch 49). Recommendation: For best aggregate action accuracy, use v1.2 (BC with IQN norm, backbone from vis/v1). For better coast prediction, use v2 (full IQN-aligned chain). Full alignment (vis+BC both with IQN norm) does not improve overall or main-actions accuracy over v1.2 in this comparison.

Recommendations (v2 chain)

For best action prediction accuracy (overall and main actions): use v1.2 (BC with image_normalization: "iqn" and encoder from vis/v1). For better coast prediction, v2 is preferable.
Reproduce comparison: python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.2 output/ptretrain/bc/v2 --interval 5.

Configuration Changes

BC pretrain (config_files/pretrain/bc/pretrain_config_bc.yaml):

v1: early_stopping enabled (or epochs capped at 25); lr as in config (e.g. 0.0005).
v1.1: early_stopping: false, lr: 0.0005, epochs: 50; image_normalization: default ("01" = [0,1]).
v1.2: Same as v1.1 except image_normalization: “iqn” (input (x - 0.5) / 0.5 on [0,1] cache).

Other settings shared: encoder_init_path: output/ptretrain/vis/v1/encoder.pt, bc_mode: backbone, n_actions: 12, batch_size: 4096, val_fraction: 0.1.

Full IQN-aligned chain (v2):

Visual pretrain: config_files/pretrain/vis/pretrain_config_vis_iqn.yaml — run_name: v2, image_normalization: “iqn”. Same schema as config_files/pretrain/vis/pretrain_config.yaml.
BC pretrain: config_files/pretrain/bc/pretrain_config_bc_v2.yaml — run_name: v2, encoder_init_path: output/ptretrain/vis/v2/encoder.pt, image_normalization: “iqn”.

Hardware

Same as other pretrain/RL runs (single GPU, Windows).

Conclusions

Training length: ~20–25 epochs is enough; val_loss and val_acc do not improve after that and val_loss increases by 50 epochs (overfitting).
Early stopping: Recommended. Use val_loss with patience 5–10 so runs stop around 20–25 epochs and avoid overfitting.
Image normalization (v1.1 vs v1.2): IQN-style normalization helps. With image_normalization: “iqn” (v1.2), validation loss is ~15% lower (0.949 vs 1.119 at 50 epochs) and best-epoch val_loss and main_actions_val_acc are better (0.941 and 0.52 vs 0.994 and 0.42). Overall val_acc at the last epoch is similar or slightly lower; the gain is in generalization (loss). Both runs use a backbone pretrained without IQN normalization; using IQN normalization at BC time still improves results. Use ``image_normalization: “iqn”`` for BC when the encoder is intended for IQN transfer.
Full IQN-aligned chain (v2): Compared with analyze_pretrain_bc.py v1.2 v2. For action prediction accuracy: v1.2 is marginally better (higher overall val_acc and main_actions_val_acc at best epoch 38). v2 is better for coast (0.64 vs 0.45). Val_loss is tied. Use v1.2 for best aggregate action accuracy; use v2 if coast prediction matters more. See “Experiment: Full IQN-aligned chain” subsection for full metrics.
Three-way (v1.1 vs v1.2 vs v2): v1.1 has highest overall val_acc at epoch 49 (0.625) but worst val_loss (1.12) and lowest main_actions_val_acc at best epoch (0.42). v1.2 has best main_actions_val_acc at best epoch (0.517) and best val_loss. v2 has best coast (0.64). So: for generalization and main-actions accuracy prefer v1.2; for raw end val_acc v1.1 is highest (with more overfitting); for coast prefer v2.
Per-action metrics: Report and compare per-action accuracy by name (accel, left+accel, right+accel, coast, left+accel+brake, right+accel+brake). Rare actions (left, right, brake, left+brake, right+brake, accel+brake) have ~0 val samples; their val_acc_class_* are 0 or negligible.
BC target (v2 vs v2_next_tick): bc_target: next_tick gives lower action prediction accuracy than current_tick: overall val_acc ~0.55 vs 0.62, main_actions_val_acc at best epoch 0.32 vs 0.51; coast drops to 0 with next_tick. Prefer current_tick for best accuracy unless the downstream task requires next-tick prediction.
n_stack (v2_next_tick vs v2_next_tick_stack3): n_stack=3 uses three consecutive frames per sample (temporal context) but significantly increases RAM usage (cache ~3× larger, batch 3× more pixel data). The stack3 config uses cache_load_in_ram: false and workers: 1 to avoid OOM. Comparison (documented v2_next_tick n_stack=1 vs current overwritten run = n_stack=3): n_stack=3 gives slightly better val_loss (1.074 vs 1.116), best-epoch main_actions_val_acc (0.327 vs 0.315), and accel accuracy (0.744 vs 0.702); some actions are slightly worse. The gain is modest; whether it justifies the memory cost is a trade-off.
Multi-offset Linear vs A_head (v2_multi_offset vs v2_multi_offset_ahead): v2_multi_offset (Linear heads) gives better validation (val_acc 0.612, per-offset ~0.59–0.62, best epoch 49). v2_multi_offset_ahead (MLP A_head) overfits: best epoch by val_loss = 14, then val_acc drops to 0.55 and val_loss rises to 2.88 by epoch 49. Use v2_multi_offset for best multi-offset accuracy; if using v2_multi_offset_ahead for RL merge, enable early stopping or use checkpoint at best epoch ~14 (see “Experiment: vis backbone + a_head only”). v2_multi_offset_ahead_tuned (lr 0.0002, weight_decay 0.01, early_stopping) reaches val_acc 0.591 and per-offset ~0.59–0.60, stops at ~epoch 26; recommended for RL merge when you need A_head.

Recommendations

Use early stopping on val_loss (patience 5–10) for BC pretrain.
Do not train beyond ~25 epochs unless you have evidence of underfitting (e.g. val_loss still decreasing).
Use image_normalization: “iqn” for BC pretrain when the encoder will be loaded into IQN; it reduces val_loss and improves best-epoch metrics even when the visual backbone was pretrained without IQN normalization.
Best for action prediction accuracy: Use v1.2 (BC with IQN norm, encoder from vis/v1) for best main_actions_val_acc at best epoch and best val_loss; use v2 if coast prediction is the priority. v1.1 has the highest overall val_acc at epoch 49 (0.625) but the worst val_loss (overfitting).
n_stack: Use n_stack=1 unless you have evidence that n_stack=3 improves accuracy enough to justify the higher RAM usage (and possibly cache_load_in_ram: false, fewer workers). Run both with distinct run names and compare with analyze_pretrain_bc.py.
Multi-offset (v2_multi_offset vs v2_multi_offset_ahead): For best multi-offset validation accuracy use v2_multi_offset (Linear heads). For RL merge of vis + A_head only use v2_multi_offset_ahead_tuned (lr 0.0002, weight_decay 0.01, early_stopping — val_acc 0.591, stops ~epoch 26) or v2_multi_offset_ahead_reg (lr 0.0001, weight_decay 0.02, no early stop, 50 epochs — val_acc 0.592, best epoch 35) or v2_multi_offset_ahead_dropout (reg + dropout 0.2 — val_acc 0.595, val_loss 1.989, best A_head) or v2_multi_offset_ahead_dropout_inner (dropout 0.2 + action_head_dropout 0.1 — val_acc 0.597, val_loss 1.971, best A_head). Untuned v2_multi_offset_ahead overfits; if used, enable early stopping or checkpoint at epoch ~14.
For analysis, use scripts/analyze_pretrain_bc.py with run dirs or --base-dir output/ptretrain/bc v1 v1.1 (or v1.1 v1.2 for normalization comparison) to get tables by epoch, combined analysis (best epoch by val_loss + main_actions_val_acc), and per-action accuracy with human-readable names (accel, left+accel, right+accel, etc.).
To generate the per-action accuracy plot: python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1 output/ptretrain/bc/v1.1 --plot --output-dir docs/source/_static (saves exp_pretrain_bc_per_action_accuracy.jpg).

Analysis tools

BC pretrain comparison: python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1 output/ptretrain/bc/v1.1 --interval 5
Normalization (v1.1 vs v1.2): python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.1 output/ptretrain/bc/v1.2 --interval 5
Full IQN-aligned chain (v1.2 vs v2): python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.2 output/ptretrain/bc/v2 --interval 5
BC target current_tick vs next_tick (v2 vs v2_next_tick): python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2 output/ptretrain/bc/v2_next_tick --interval 5
Multi-offset Linear vs A_head (incl. dropout): .\.venv\Scripts\python.exe scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2_multi_offset output/ptretrain/bc/v2_multi_offset_ahead output/ptretrain/bc/v2_multi_offset_ahead_tuned output/ptretrain/bc/v2_multi_offset_ahead_reg output/ptretrain/bc/v2_multi_offset_ahead_dropout output/ptretrain/bc/v2_multi_offset_ahead_dropout_inner --interval 5
n_stack 1 vs 3 (v2_next_tick vs v2_next_tick_stack3): Original v2_next_tick (n_stack=1) was overwritten; comparison is in the doc (documented n_stack=1 vs current dir = n_stack=3). To print metrics for the current run only: python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v2_next_tick --interval 5. If you re-run both with distinct run names, use the two-dir command to compare.
Three-way (v1.1 vs v1.2 vs v2): python scripts/analyze_pretrain_bc.py output/ptretrain/bc/v1.1 output/ptretrain/bc/v1.2 output/ptretrain/bc/v2 --interval 5
With explicit CSV: python scripts/analyze_pretrain_bc.py --csv output/ptretrain/bc/v1/csv/metrics.csv output/ptretrain/bc/v1.1/csv/metrics.csv
With base dir: python scripts/analyze_pretrain_bc.py --base-dir output/ptretrain/bc v1 v1.1