BC pretraining

Experiment Overview

This experiment tested whether adding behavioral cloning (BC) pretraining on top of the visual backbone pretraining improves RL training. Two BC-based RL variants are compared:

A01_as20_long_vis_bc_pretrained — three stages: (1) Visual pretrain (vis v1), (2) BC pretrain with backbone only (single-frame action prediction, encoder saved), (3) RL with that encoder.
A01_as20_long_vis_bc_ah_pretrained — uses both the visual encoder (img head) and the action head (A_head) from the v2_multi_offset_ahead_dropout_inner BC experiment (see BC pretrain: training length & early stop (Level 1)). That BC run uses multi-offset action prediction, IQN-style A_head (MLP) with dropout (0.2 on features, 0.1 inside head), vis v2 backbone, and achieves best val_acc 0.597 / val_loss 1.971. RL injects encoder.pt and actions_head.pt from output/ptretrain/bc/v2_multi_offset_ahead_dropout_inner/.

Stages for the first variant:

Visual pretrain — same as A01_as20_long_vis_pretrained: autoencoder on replay frames (config_files/pretrain/vis/pretrain_config.yaml), producing output/ptretrain/vis/v1/encoder.pt.
BC pretrain — train a CNN to predict actions from images, initialized from the vis encoder (config_files/pretrain/bc/pretrain_config_bc.yaml). BC mode backbone: only the encoder is saved (output/ptretrain/bc/v1.1/encoder.pt) and injected into IQN. In this experiment BC was simply action prediction from a single frame (no temporal context).
RL training — same RL config as the other runs, with pretrain_encoder_path: "output/ptretrain/bc/v1.1/encoder.pt" so the BC-trained encoder is used as the IQN visual backbone.

Hypothesis: BC might provide a better initialization than visual-only pretraining by aligning the backbone with action-relevant features. Using both encoder and A_head (bc_ah) may give a stronger policy prior than encoder-only.

Main question: Does BC help over vis-only? Does adding the pretrained A_head (bc_ah) improve over encoder-only BC (bc_pretrained)?

Results

Important: Run durations differed (A01_as20_long ~495 min, A01_as20_long_vis_pretrained ~275 min, A01_as20_long_vis_bc_pretrained ~262 min, A01_as20_long_vis_bc_ah_pretrained ~244 min). All findings below are by relative time (minutes from run start) and by steps (training step checkpoints). Comparing by “last value” is invalid. Common window for all four runs: up to 244 min (shortest = bc_ah).

Key Findings:

BC does not improve final performance over vis-only (encoder-only BC). By the end of the common window (260 min for the first three runs), vis_pretrained has the best A01 time (24.47s), then bc_pretrained (24.49s), then baseline (24.53s).
BC + A_head (bc_ah) matches the best A01 time. At 240 min (end of bc_ah run), bc_ah reaches 24.47s — tied with vis_pretrained and better than bc_pretrained (24.49s) and baseline (24.53s). So the new experiment (encoder + A_head from v2_multi_offset_ahead_dropout_inner) is as good as vis-only on final best time and better than encoder-only BC.
BC gives the fastest early convergence (encoder-only): First eval finish at 5.1 min (bc) vs 8.3 min (baseline) vs 11.2 min (vis_pretrained). bc_ah has first eval finish at 15.2 min (slowest of the four) — pretrained A_head does not help early first finish.
By steps: At 10.85M steps (common max for first three), same ordering: vis_pretrained 24.47s, bc 24.49s, baseline 24.53s. bc_ah run is shorter; at equal relative time (244 min) bc_ah reaches 24.47s.
Eval finish rate at 240 min: vis_pretrained 74%, bc_pretrained 70%, bc_ah 68%, baseline 63%. bc_ah is between bc and baseline.
Training loss at 240 min: bc_pretrained 54.93 (lowest), bc_ah 62.58, baseline 63.89, vis_pretrained 64.33. bc_ah has lower loss than baseline and vis.
Conclusion: BC + A_head (bc_ah) is the only BC variant that matches vis-only on final A01 best time (24.47s) and beats encoder-only BC (24.49s). It does not improve over vis on finish rate or early first finish; it is a viable alternative when using the multi-offset A_head pretrain.

Does pretrain and transfer help? (Summary)

Question: With the setups tried so far, does pretrain (BC encoder + A_head) and transfer actually help, or is it not working?

Answer: Yes, pretrain helps when used correctly. Evidence: bc_ah (encoder + A_head trainable) reaches 24.47s — tied with vis-only. enc_ah_freeze (both frozen) fails (24.97s, 22% finish rate). enc_ah_freeze_resume (unfreeze + lower lr/epsilon) recovers to 24.51s from the poor enc_ah_freeze checkpoint. So pretrain works when allowed to fine-tune; freezing A_head hurts; unfreezing with gentle lr/epsilon recovers. It does not beat vis-only on final time, but is a viable path. For best final time, use bc_ah or vis-only; avoid freezing A_head.

Run Analysis

A01_as20_long (baseline): No pretrain. pretrain_encoder_path: null. IQN from random weights. ~495 min, 3 TensorBoard log dirs merged.
A01_as20_long_vis_pretrained: Visual pretrain only. pretrain_encoder_path: "output/ptretrain/vis/v1/encoder.pt". ~275 min, 2 log dirs merged.
A01_as20_long_vis_bc_pretrained: Vis pretrain then BC pretrain then RL. pretrain_encoder_path: "output/ptretrain/bc/v1.1/encoder.pt". BC encoder from config_files/pretrain/bc/pretrain_config_bc.yaml (encoder_init_path: vis v1, bc_mode: backbone, 50 epochs, action prediction from image). ~262 min, 2 log dirs merged.
A01_as20_long_vis_bc_ah_pretrained: RL with encoder + A_head from v2_multi_offset_ahead_dropout_inner. pretrain_encoder_path: "output/ptretrain/bc/v2_multi_offset_ahead_dropout_inner/encoder.pt", pretrain_actions_head_path: "output/ptretrain/bc/v2_multi_offset_ahead_dropout_inner/actions_head.pt". BC run: config_files/pretrain/bc/pretrain_config_bc_v2_multi_offset_ahead_dropout_inner.yaml — vis v2 backbone, multi-offset bc_time_offsets_ms: [-10, 0, 10, 100], use_actions_head: true, dropout 0.2 on features, action_head_dropout 0.1; val_acc 0.597, val_loss 1.971 (best A_head variant). ~244 min, 2 log dirs merged.
A01_as20_long_vis_bc_ah_pretrained_enc_freeze: Same pretrain sources as bc_ah, but encoder frozen during RL: pretrain_encoder_freeze: true. A_head, float_feature_extractor, iqn_fc, V_head trainable. ~158 min, 2 log dirs merged.
A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze: Same pretrain, but both encoder and A_head frozen during RL: pretrain_encoder_freeze: true, pretrain_actions_head_freeze: true. Only float_feature_extractor, iqn_fc, V_head trainable. ~228 min, 2 log dirs merged.
A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze_resume: Resumed from enc_ah_freeze weights; encoder and A_head unfrozen, lower lr (5e-5 at start, 1e-4 from 500k) and lower epsilon (0.1 at start vs 1.0). All parts trainable. ~259 min, 2 log dirs merged.

Detailed TensorBoard Metrics Analysis

Methodology — Relative time and by steps: Metrics are compared (1) at checkpoints 5, 10, 15, 20, … min (only up to the shortest run; when including bc_ah, use the common window of all four runs) and (2) at step checkpoints 50k, 100k, … (only up to the smallest max step). The figures below show one metric per graph (runs as lines, by relative time). To generate plots including bc_ah, use the four-run command in Analysis Tools and regenerate with --experiments pretrain_bc.

A01 Map Performance (common window up to 244 min for all four runs)

Baseline (A01_as20_long): at 35 min — 25.02s; at 85 min — 24.71s; at 150 min — 24.59s; at 240 min — 24.53s. First eval finish ~8.3 min.
Vis pretrained (A01_as20_long_vis_pretrained): at 35 min — 24.79s; at 85 min — 24.55s; at 150 min — 24.50s; at 240 min — 24.47s. First eval finish ~11.2 min.
BC pretrained (A01_as20_long_vis_bc_pretrained): at 10 min — 24.92s (fastest early); at 35 min — 24.89s; at 85 min — 24.59s; at 150 min — 24.55s; at 240 min — 24.49s. First eval finish ~5.1 min (earliest).
BC + A_head (A01_as20_long_vis_bc_ah_pretrained): First eval finish ~15.2 min (latest of the four). At 30 min — 24.83s; at 65 min — 24.59s; at 100 min — 24.55s; at 140 min — 24.53s; at 190 min — 24.47s; at 240 min — 24.47s (tied with vis_pretrained for best). Eval finish rate at 240 min: 68%.

A01 best time by relative time (A01_as20_long vs vis_pretrained vs vis_bc_pretrained)

Training Loss

Baseline: at 90 min — 64.29; at 240 min — 63.89.
Vis pretrained: at 90 min — 61.35; at 240 min — 64.33.
BC pretrained: at 90 min — 71.70; at 240 min — 54.93 (lowest at 240 min).
BC + A_head (bc_ah): at 90 min — 54.06; at 240 min — 62.58. Lower than baseline and vis at 240 min.

Average Q-values

At 240 min: baseline -1.14, vis_pretrained -0.68, bc_pretrained -0.61, bc_ah -1.49. No clear winner; bc_ah is more negative (more conservative Q).

Avg Q by relative time (pretrain_bc three-way comparison)

GPU Utilization

All ~69–72% over the common window; no significant difference.

Experiment: Encoder freeze (bc_ah vs enc_freeze)

Goal: Compare A01_as20_long_vis_bc_ah_pretrained (encoder + A_head from v2_multi_offset_ahead_dropout_inner, encoder trainable) with A01_as20_long_vis_bc_ah_pretrained_enc_freeze (same pretrain, img_head frozen during RL). Only difference: pretrain_encoder_freeze: true in enc_freeze.

Common window: Up to 158 min (shortest run = enc_freeze). All comparisons below are by relative time over this window.

A01 map performance:

bc_ah (encoder trainable): First eval finish ~15.2 min. At 60 min — 24.80s; at 100 min — 24.55s; at 155 min — 24.50s. Eval finish rate at 155 min — 66%. Run continued to 244 min and reached 24.47s.
enc_freeze (encoder frozen): First eval finish ~4.1 min (much earlier than bc_ah). At 60 min — 24.81s; at 100 min — 24.60s; at 155 min — 24.55s. Eval finish rate at 155 min — 64%. Run ended at ~158 min; best time in window 24.55s.

Conclusion (encoder freeze): Over the common window (158 min), bc_ah has slightly better best A01 time (24.50s vs 24.55s at 155 min) and slightly higher eval finish rate (66% vs 64%). enc_freeze has much earlier first eval finish (4.1 min vs 15.2 min). Training loss at 155 min: bc_ah 57.29, enc_freeze 63.33 (bc_ah lower). So freezing the encoder preserves a strong policy prior (very fast first finish) but does not match the trainable-encoder run on best time within 158 min; the trainable encoder (bc_ah) pulls ahead by ~50 ms and has lower loss. For best final A01 time in long runs, keep the encoder trainable; use enc_freeze if you want fastest early first finish or fewer parameters to train.

A01 best time by relative time (bc_ah vs enc_freeze)

Training loss and avg Q (at 60, 100, 155 min):

Loss: bc_ah 74.89 / 60.07 / 57.29; enc_freeze 89.35 / 62.26 / 63.33. bc_ah lower at all checkpoints.
Avg Q: Both fluctuate; at 155 min bc_ah -0.66, enc_freeze -1.34.

Training loss by relative time (bc_ah vs enc_freeze)

Avg Q by relative time (bc_ah vs enc_freeze)

Reproduce: python scripts/analyze_experiment_by_relative_time.py A01_as20_long_vis_bc_ah_pretrained A01_as20_long_vis_bc_ah_pretrained_enc_freeze --interval 5 --step_interval 50000

Experiment: Encoder + A_head freeze (three-way: bc_ah vs enc_freeze vs enc_ah_freeze)

Goal: Compare all three freeze variants with the same BC+AH pretrain: (1) bc_ah — both encoder and A_head trainable; (2) enc_freeze — encoder frozen, A_head trainable; (3) enc_ah_freeze — both encoder and A_head frozen (only float_feature_extractor, iqn_fc, V_head trainable).

Common window: Up to 158 min (shortest = enc_freeze). All three runs have data up to 155 min.

A01 map performance (at 155 min):

|-----|———|--------|———————|-------------------|—————————-|----------------| | bc_ah | trainable | trainable | 24.50s | 15.2 min | 66% | 57.29 | | enc_freeze | frozen | trainable | 24.55s | 4.1 min | 64% | 63.33 | | enc_ah_freeze | frozen | frozen | 24.97s | 8.9 min | 22% | 279.12 |

Conclusions (what to freeze):

Freezing both encoder and A_head (enc_ah_freeze) severely hurts RL: best time 24.97s (~470 ms worse than bc_ah), finish rate 22% (vs 66%), loss ~5x higher. The frozen A_head blocks policy adaptation; float_feature_extractor, iqn_fc, and V_head alone cannot compensate. Do not freeze A_head.
Freezing only the encoder (enc_freeze) is acceptable: best time 24.55s (~50 ms worse than bc_ah), finish rate 64%, loss slightly higher. enc_freeze has much earlier first eval finish (4.1 min vs 15.2 min). Use when you want fastest early first finish or fewer trainable parameters.
Best final A01 time: Keep both encoder and A_head trainable (bc_ah). For long runs aiming at 24.47s, do not freeze either.

Summary: Freeze encoder only if you prioritize early first finish; never freeze A_head. For best final time, train both.

A01 best time by relative time (bc_ah vs enc_freeze vs enc_ah_freeze)

Training loss by relative time (bc_ah vs enc_freeze vs enc_ah_freeze)

Avg Q by relative time (bc_ah vs enc_freeze vs enc_ah_freeze)

Reproduce: python scripts/analyze_experiment_by_relative_time.py A01_as20_long_vis_bc_ah_pretrained A01_as20_long_vis_bc_ah_pretrained_enc_freeze A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze --interval 5 --step_interval 50000

Experiment: Resume from enc_ah_freeze (unfreeze + lower lr and epsilon)

Goal: Take enc_ah_freeze checkpoint (both encoder and A_head frozen during RL; best 24.97s, 22% finish rate), resume training with everything unfrozen and reduced lr/epsilon. Hypothesis: gentle fine-tuning of the pretrained parts will recover or improve performance.

Setup: Load weights from enc_ah_freeze; set pretrain_encoder_freeze: false, pretrain_actions_head_freeze: false; lr_schedule: 5e-5 (0–500k), 1e-4 (500k–3M), 5e-5 (3M+); epsilon: 0.1 at start (vs 1.0 in enc_ah_freeze), 0.5 at 300k, 0.03 by 3M.

Common window: Up to 228 min (shortest = enc_ah_freeze). enc_ah_freeze_resume ran ~259 min.

A01 map performance (by relative time):

|-----|———————|-------------------|—————————-|----------------| | enc_ah_freeze | 24.97s | 8.9 min | 29% | 229.71 | | enc_ah_freeze_resume | 24.51s | 6.0 min | 56% | 73.39 |

By steps (at 9.65M steps, common for both): enc_ah_freeze best 24.97s; enc_ah_freeze_resume best 24.51s (460 ms better).

Conclusions:

Unfreezing + lower lr/epsilon recovers from enc_ah_freeze. enc_ah_freeze_resume reaches 24.51s vs 24.97s — ~460 ms improvement. Finish rate 56% vs 22%; loss ~73 vs ~230. The pretrained encoder and A_head, when allowed to fine-tune gently, quickly adapt and improve.
First eval finish: resume 6.0 min vs freeze 8.9 min — slightly earlier.
Recommendation: If you ran enc_ah_freeze and it plateaued poorly, resume with unfreeze + lower lr (5e-5–1e-4) and lower epsilon (0.1 at start) to recover.

A01 best time by relative time (enc_ah_freeze vs enc_ah_freeze_resume)

Training loss by relative time (enc_ah_freeze vs enc_ah_freeze_resume)

Reproduce: python scripts/analyze_experiment_by_relative_time.py A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze_resume --interval 5 --step_interval 50000

Configuration Changes

RL training (encoder and optional A_head source):

# Baseline
pretrain_encoder_path: null

# Vis pretrained
pretrain_encoder_path: "output/ptretrain/vis/v1/encoder.pt"

# BC pretrained (vis -> BC -> RL, encoder only)
pretrain_encoder_path: "output/ptretrain/bc/v1.1/encoder.pt"

# BC + A_head (encoder and action head from v2_multi_offset_ahead_dropout_inner)
pretrain_encoder_path: "output/ptretrain/bc/v2_multi_offset_ahead_dropout_inner/encoder.pt"
pretrain_actions_head_path: "output/ptretrain/bc/v2_multi_offset_ahead_dropout_inner/actions_head.pt"

# BC + A_head with encoder freeze (enc_freeze run; only difference: freeze img_head during RL)
# pretrain_encoder_path and pretrain_actions_head_path same as above
pretrain_encoder_freeze: true

# BC + A_head with encoder AND A_head freeze (enc_ah_freeze run)
pretrain_encoder_freeze: true
pretrain_actions_head_freeze: true

# enc_ah_freeze_resume: unfreeze and use lower lr/epsilon
pretrain_encoder_freeze: false
pretrain_actions_head_freeze: false
lr_schedule: [[0, 0.00005], [500000, 0.0001], [3000000, 0.00005], ...]
# epsilon_schedule: [[0, 0.1], [50000, 0.1], [300000, 0.5], [3000000, 0.03]]

Visual pretrain (config_files/pretrain/vis/pretrain_config.yaml): task ae, image_size 64, n_stack 1, epochs 50, batch_size 4096, output_dir output/ptretrain/vis, run_name v1.

BC pretrain (config_files/pretrain/bc/pretrain_config_bc.yaml):

encoder_init_path: "output/ptretrain/vis/v1/encoder.pt"
bc_mode: backbone
n_actions: 12
image_size: 64
n_stack: 1
epochs: 50
batch_size: 4096
output_dir: output/ptretrain/bc
run_name: v1.1

Hardware

GPU: Same as other A01 runs.
Parallel instances: Same gpu_collectors_count (from config_default).
System: Windows.

Conclusions

Does BC help in the current variant? No for encoder-only BC: vis_pretrained still achieves the best A01 time (24.47s) and highest eval finish rate (73%) by 260 min among the first three runs.
BC + A_head (bc_ah) matches the best. The run A01_as20_long_vis_bc_ah_pretrained (encoder + A_head from v2_multi_offset_ahead_dropout_inner) reaches 24.47s at 240 min — tied with vis_pretrained and better than encoder-only bc_pretrained (24.49s) and baseline (24.53s). So the new experiment is as good as the previous best (vis-only) on final A01 time. bc_ah has slower first eval finish (15.2 min vs 5.1–11.2 min) and lower finish rate at 240 min (68% vs 74% vis); it is a viable alternative when using the multi-offset A_head pretrain.
BC advantage (encoder-only): Only in the very early phase: earliest first finish (5.1 min), good initial time (24.92s by 10 min). So BC gives a faster “cold start” but vis-only catches up and slightly outperforms by 260 min.
Encoder freeze (bc_ah vs enc_freeze): Over the common window (158 min), bc_ah (encoder trainable) has better best A01 time (24.50s vs 24.55s at 155 min) and lower training loss; enc_freeze has much earlier first eval finish (4.1 min vs 15.2 min). So freezing the encoder keeps a strong policy prior for fast first finish but does not match the trainable-encoder run on final best time within the same wall-clock. For long runs aiming at best A01 time, keep encoder trainable (bc_ah); use enc_freeze if you need fastest early first finish or fewer trainable parameters.
Encoder + A_head freeze (enc_ah_freeze): Freezing both encoder and A_head severely hurts RL: best time 24.97s (~470 ms worse than bc_ah), finish rate 22%, loss ~5x higher. The frozen A_head blocks policy adaptation. Do not freeze A_head; freeze encoder only if you prioritize early first finish.
Resume from enc_ah_freeze (enc_ah_freeze_resume): Unfreezing everything and using lower lr (5e-5 at start, 1e-4 from 500k) and lower epsilon (0.1 vs 1.0) recovers: best time 24.51s (vs 24.97s), finish rate 56% (vs 22%), loss ~73 (vs ~230). The pretrain prior is useful when allowed to fine-tune gently.

Recommendations

For best final A01 performance (current experiments): Use visual pretrain only (pretrain_encoder_path: "output/ptretrain/vis/v1/encoder.pt") or BC + A_head (bc_ah) — both reach 24.47s. Do not add the encoder-only BC stage (v1.1) if the goal is best final time; bc_ah matches vis and is better than encoder-only BC.
If you need fastest early convergence (e.g. for debugging): BC pretrain (v1.1) gives the earliest first finish (5.1 min) and a good initial time in the first 10–20 min; then consider switching to vis-only or bc_ah for long runs.
BC + A_head (bc_ah): Use pretrain_encoder_path and pretrain_actions_head_path from v2_multi_offset_ahead_dropout_inner; reaches 24.47s (tied with vis-only). To freeze encoder: pretrain_encoder_freeze: true (enc_freeze); do not freeze A_head — enc_ah_freeze performs poorly. If you have an enc_ah_freeze checkpoint, resume with unfreeze + lower lr/epsilon (enc_ah_freeze_resume) to recover to ~24.51s.

Suggested RL variations to better understand pretrain contribution:

Lower LR for pretrained parts: Use a separate param_group with 0.1x or 0.01x LR for pretrained layers (encoder, A_head) vs random-initialized (float_feature_extractor, iqn_fc, V_head). Tests whether “gentle” fine-tuning of pretrain preserves or improves results.
Warmup with frozen pretrain: Start RL with encoder (and optionally A_head) frozen for N steps (e.g. 50k–200k), then unfreeze. Compare with bc_ah and enc_freeze; tests if early stabilization helps.
Pretrain ablation: Run RL from random init (baseline), encoder only (vis v2), A_head only (random encoder + BC A_head — would require loading only A_head), and both (bc_ah). Quantifies contribution of encoder vs A_head vs synergy.
Shorter runs with matched steps: Run all variants for exactly the same number of gradient steps (e.g. 5M) to compare sample efficiency without wall-clock bias.
Different exploration schedules: With pretrain, the policy is already reasonable; try faster epsilon decay (e.g. 0.1 by 100k instead of 3M) to reduce random actions and see if it helps or hurts.

Analysis Tools:

By relative time and by steps (three runs): python scripts/analyze_experiment_by_relative_time.py A01_as20_long A01_as20_long_vis_pretrained A01_as20_long_vis_bc_pretrained --interval 5 --step_interval 50000
By relative time and by steps (four runs, include bc_ah): python scripts/analyze_experiment_by_relative_time.py A01_as20_long A01_as20_long_vis_pretrained A01_as20_long_vis_bc_pretrained A01_as20_long_vis_bc_ah_pretrained --interval 5 --step_interval 50000
bc_ah vs enc_freeze (two runs): python scripts/analyze_experiment_by_relative_time.py A01_as20_long_vis_bc_ah_pretrained A01_as20_long_vis_bc_ah_pretrained_enc_freeze --interval 5 --step_interval 50000
bc_ah vs enc_freeze vs enc_ah_freeze (three runs): python scripts/analyze_experiment_by_relative_time.py A01_as20_long_vis_bc_ah_pretrained A01_as20_long_vis_bc_ah_pretrained_enc_freeze A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze --interval 5 --step_interval 50000
enc_ah_freeze vs enc_ah_freeze_resume: python scripts/analyze_experiment_by_relative_time.py A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze A01_as20_long_vis_bc_ah_pretrained_enc_ah_freeze_resume --interval 5 --step_interval 50000
Plots: python scripts/generate_experiment_plots.py --experiments pretrain_bc pretrain_bc_enc_freeze pretrain_bc_enc_ah_freeze pretrain_bc_enc_ah_freeze_resume (generates comparison plots when tensorboard logs exist)