Configuration Guide

This guide documents the main configuration knobs in TrackMania RL (Rulka).

Configuration files are located in config_files/ and organized by category for easy editing. Field names and defaults are split between config_files/config_schema.py (environment, training, memory, …) and config_files/nn_schema.py (YAML key nn). This guide explains behavior and typical usage; for an exhaustive nn tree see Neural network YAML (nn) — full reference.

Quick Start

Configuration is loaded from a single YAML file at startup and accessed via get_config():

from config_files.config_loader import get_config

# Access any setting (flat attribute access)
cfg = get_config()
batch_size = cfg.batch_size
learning_rate = cfg.lr_schedule

To run training with a specific config:

python scripts/train.py --config config_files/rl/config_default.yaml

You can version configs with separate YAML files (e.g. config_uni18.yaml) and pass the path with --config. User-specific settings (paths, usernames) are read from a .env file in the project root. Config is loaded once per process and cached; there is no hot-reload.

Configuration Structure (YAML)

The default YAML (config_files/rl/config_default.yaml) is organized into sections that correspond to the former Python modules:

  1. environment — Environment and simulation

  2. nn — Network architecture (see Neural network YAML (nn) — full reference): fusion_mode, init_from_pretrained, vis, float, encoder, iqn, decoder, training, plus optional RL parameter freeze flags under those subtrees

  3. training — Training hyperparameters (algorithm, LR, schedules, …)

  4. memory — Replay buffer (IQN)

  5. exploration — Exploration strategies (IQN)

  6. rewards — Reward shaping

  7. map_cycle — Map training cycle

  8. performance — System performance

  9. btr — Optional IQN/BTR flags (Munchausen, LayerNorm, NoisyNet, …). CNN fields in btr: can be merged into nn.vis.cnn at load when those keys are omitted there (config_loader._merge_btr_cnn_into_vis). Prefer setting vision CNN under nn.vis.cnn in new configs.

Neural network YAML (nn) — full reference

The hierarchical block under YAML key nn is validated as config_files.nn_schema.NnConfig. It is merged from disk in config_loader.load_config (including BTR→CNN fill) into the flat NeuralNetworkConfig on RulkaConfig.neural_network (there is no separate top-level YAML key neural_network:; pretrain and scripts that mention image size should think nn.vis.image_size / get_config().w_downsized). For a full matrix of supported topologies (algorithm × fusion_mode × vision × fusion trunk), see NN topology catalog (supported stacks).

Flat access: get_config() (ConfigView) exposes most fields as attributes: e.g. cfg.vis, cfg.decoder, cfg.fusion_mode, cfg.float_hidden_dim, cfg.dense_hidden_dimension, cfg.transformers (bundle from NnConfig.to_multimodal(): fusion_mode + encoder.transformer + init_from_pretrained), cfg.use_jit, cfg.iqn_n, …

nn.fusion_mode (string)

none | vision_transformer | post_concat | unified. PPO: selects TorchMultimodalActorCritic vs CNN vs HF-only actor. IQN: when not none, builds the same multimodal body as PPO (TorchMultimodalActorCritic with include_policy_heads=False) inside IQNSharedBackboneNetwork (IQN architecture). When none, IQN uses classic IQN_Network or HF-vision stack (see that page). The schema enforces consistency with nn.vis (e.g. unified vs d_model). YAML may place fusion_mode under nn.encoder; the loader hoists it to nn.

nn.init_from_pretrained (string)

Optional directory with Rulka save_pretrained fusion weights when fusion_mode != none. PPO loads it inside make_multimodal_fusion_network_pair after build (unless skipped via trackmania_rl.utilities.skip_multimodal_fusion_hub_init_from_pretrained). IQN does not run that hub load automatically today — start fusion IQN from an RL checkpoint or wire loading in your script. Hub JSON must include rulka_transformers.vis_branch (current saves add it via infer_vis_branch). Trust flag: nn.encoder.transformer.trust_remote_code.

nn.vis — vision branch

no_image (bool): if true, float-only (no CNN/ViT). image_size: width / height (aliases w / h). freeze (bool, default false): see RL parameter freeze. Exactly one of:

  • ``cnn``VisCnnBodyConfig: use_impala_cnn, impala_model_size, use_adaptive_maxpool, adaptive_maxpool_size, use_spectral_norm.

  • ``transformer``TransformersConfig (ViT slot / HF / fusion vision): use_hf_backbone, model_name_or_path, trust_remote_code, hidden_dropout_prob, d_model, n_layers, n_heads, ff_mult, dropout, patch_size, post_concat_seq_len, unified_float_tokens, fusion_tokens (summary | patch_tokens — vision tokens into multimodal fusion, e.g. post_concat + token_sequence). Constraint: d_model divisible by n_heads.

nn.float (YAML key float)

freeze (bool): see RL parameter freeze. mlp.hidden_dim — width of the float MLP trunk (default branch).

nn.encoder

Used when fusion_mode != none (multimodal bundle also exposed as cfg.transformers from NnConfig.to_multimodal()).

  • freeze (bool): see RL parameter freeze.

  • mlp (optional): hidden_dim — overrides float width for fusion builds (float_hidden_dim_effective()), IQN and PPO.

  • transformer: fusion-stack TransformersConfig (d_model, n_layers, n_heads, ff_mult, dropout, post_concat_seq_len, unified_float_tokens; HF fusion: use_hf_backbone, model_name_or_path, trust_remote_code, hidden_dropout_prob).

  • fusion_encoder (optional): linear | native_transformer | mlp | cnn | hf_embedding — trunk after early fusion; if omitted, inferred (e.g. linear for vision_transformer, native_transformer for post_concat / unified, hf_embedding when encoder.transformer.use_hf_backbone).

  • fusion_mlp / fusion_cnn / hf_embedding: sub-configs when fusion_encoder is mlp / cnn / hf_embedding (see config_files/nn_schema.py).

  • post_concat_layout: fused_vector | token_sequence (post_concat only).

  • float_token_input: raw | mlp_hidden; float_token_layout: dense | per_feature (for token_sequence layouts).

The same keys appear on the flattened multimodal dict (cfg.transformers) for factories; YAML loads them under nn.encoder and the root nn for fusion_mode / init_from_pretrained.

nn.iqn

embedding_dimension, n, k, kappa — IQN quantile hyperparameters. freeze (bool): see RL parameter freeze (IQN only).

nn.decoder (IQN heads / width; PPO trunk + heads)

shared_input: pre_tau | post_tau (transformer slots require post_tau in the schema). dense_hidden_dimension. shared_trunk_freeze (bool): see RL parameter freeze (PPO only). Per slot advantage and value: optional freeze (bool); see RL parameter freeze. Slot body is either

  • ``mlp``MLPConfig: hidden_dim (alias hidden), n_hidden_layers (alias layers); if hidden_dim omitted, heads use dense_hidden_dimension // 2.

  • ``transformer``TransformerStackConfig: d_model, n_layers, n_heads, ff_mult, dropout; use_hf_backbone: true is rejected until implemented. Legacy key transformer_encoder is accepted as an alias for transformer.

nn.training (YAML key training inside nn)

IQN-adjacent: use_jit, use_ddqn, clip_grad_value, clip_grad_norm, number_memories_trained_on_between_target_network_updates, soft_update_tau, target_self_loss_clamp_ratio, reset knobs (single_reset_flag, reset_every_n_frames_generated, additional_transition_after_reset, last_layer_reset_factor, overall_reset_mul_factor). Not the same dict as top-level training: (run name, algorithm, LR).

RL parameter freeze

Optional booleans (default false) under YAML nn mark matching weights as non-trainable during IQN or PPO learning: requires_grad=False and those tensors are not passed to the optimizer. Prefix matching follows module.named_parameters() names; keys prefixed with _orig_mod. (from torch.compile) are normalized the same way as in training (see trackmania_rl/param_freeze.py). IQN periodic soft reset uses the same prefix set so frozen weights are not overwritten.

Flags and effect

  • nn.vis.freezeVision stem. Classic IQN (IQN_Network): img_head. only. IQN multimodal or HF-vision shared stack: same vision prefixes as PPO but under submodule fusion. (e.g. fusion.img_head., fusion._hf_vis_backbone.). PPO: those prefixes without the fusion. prefix (trackmania_rl/param_freeze.py). Only names that exist in your run are frozen; the learner log lists active prefixes.

  • nn.float.freeze — Classic IQN: float_feature_extractor.. IQN multimodal / HF vision: fusion.float_feature_extractor. (and fusion.float_to_hidden. on HF-vision-only IQN). PPO: float_feature_extractor. and float_to_hidden. where applicable.

  • nn.encoder.freezeFusion trunk (after vision+float tokenization, before policy trunk / IQN quantile block): bridge, enc_fusion_native, HF fusion projections, sequence tokenizers, position parameters, … PPO: top-level module names. IQN with fusion_mode != none: same logical tensors, prefixed with fusion.. IQN with HF vision only and fusion_mode: none: the flag currently does not map to fusion-trunk prefixes (no separate encoder submodule in that wrapper); use vis / float / head freezes as needed.

  • nn.iqn.freezeIQN only: iqn_fc. (quantile cosine → hidden MLP). Ignored for PPO.

  • nn.decoder.advantage.freeze / nn.decoder.value.freeze — IQN: A_head. and A_head_multi. (multi-action) / V_head.. PPO: policy_head. / value_head..

  • nn.decoder.shared_trunk_freezePPO only: trunk. (shared MLP before policy/value heads). Ignored for IQN.

Removed (do not use in new configs): top-level training: keys pretrain_encoder_freeze, pretrain_float_head_freeze, pretrain_iqn_fc_freeze, pretrain_actions_head_freeze, pretrain_V_head_freeze, and nn_frozen_param_prefixes — freeze is configured only under nn as above.

BC-only models that wrap the policy (e.g. multi-offset BC with extra bc_heads) are outside this RL freeze map; pretrain code may freeze heads separately.

Example (IQN — freeze visual backbone only):

nn:
  vis:
    freeze: true
    image_size: { width: 64, height: 64 }
    cnn: { use_impala_cnn: true, impala_model_size: 2, ... }

Example (PPO fusion — freeze ViT stem, train fusion + trunk):

nn:
  vis:
    freeze: true
  encoder:
    freeze: false
  decoder:
    shared_trunk_freeze: false
    advantage:
      freeze: false
      mlp: { layers: 1 }
    value:
      freeze: false
      mlp: { layers: 1 }
float_input_dim (int)

Computed at load from environment; do not set in YAML unless you know the implications.

Reference YAML files (config_files/rl/):

  • IQN: config_default.yaml, config_btr.yaml — classic IQN_Network (fusion_mode: none). For multimodal IQN, use the same nn.fusion_mode / nn.encoder / nn.vis layout as PPO (e.g. config_ppo_post_concat_cnn_tf.yaml, config_ppo_transformer.yaml) with training.algorithm: iqn. config_btr_post_concat_cnn_transformer.yaml — full BTR recipe (btr: + Munchausen / NoisyNet / etc.) with post_concat + CNN vision + fusion TransformerEncoder (same IQN stack as other fusion configs).

  • PPO: config_ppo.yaml — full example (CNN default; comments for HF / fusion). config_ppo_cnn_mlp.yaml — minimal CNN + float MLP, fusion_mode: none. config_ppo_transformer.yamlpost_concat multimodal with HF timm vision + HF fusion encoder (token_sequence / per_feature floats; not vision_transformer mode). config_ppo_post_concat_cnn_tf.yamlpost_concat with nn.vis.cnn + native torch.nn.TransformerEncoder fusion. Native fusion_mode: vision_transformer (patch + fuse) has no separate reference YAML; derive from config_ppo.yaml by setting nn.fusion_mode and nn.vis.transformer with use_hf_backbone: false.

Note

Implementation helpers (avoid duplicating dicts in code): vision CNN kwargs for _build_img_head are centralized in trackmania_rl/nn_build/vis_cnn_head.py (nn.vis.cnn after BTR merge). IQN head flags use_layer_norm / use_noisy_linear / noisy_sigma0 from the flat loaded config are read via trackmania_rl/nn_build/iqn_btr_from_config.py for classic IQN_Network and IQNSharedBackboneNetwork. Level 0 visual pretrain builds its 1-channel encoder from PretrainConfig.rl_config_path using the same vision kwargs when cnn_head_kw is set (see pretrain/models.py / pretrain/train.py).

BTR block (btr:)

Optional IQN enhancements (config_files.config_schema.BTRConfig). Same training.algorithm: iqn; flags mainly affect classic IQN_Network (CNN / float / head wiring). Multimodal IQN still uses the same iqn_fc + dueling heads and shares LayerNorm / NoisyNet style with the decoder config where wired. See BTR options (IQN + paper extras).

Fields (all booleans unless noted):

  • use_munchausen, munchausen_alpha, munchausen_entropy_tau, munchausen_lo — Munchausen-style targets.

  • use_impala_cnn, impala_model_size, use_adaptive_maxpool, adaptive_maxpool_size, use_spectral_norm — also mirrored for merge into nn.vis.cnn when those CNN keys are missing there; canonical place for vision CNN is nn.vis.cnn (see config_files/rl/config_btr.yaml).

  • use_layer_norm — LayerNorm in float extractor / IQN heads where wired.

  • use_noisy_linear, noisy_sigma0 — factorized noisy layers in heads; rollouts use noise reset/disable.

Flat access: get_config().use_munchausen, etc., resolve from btr. For code that builds IQN MLP heads, the same three dense-head toggles are grouped as iqn_btr_mlp_head_kw_from_config(get_config()) in trackmania_rl/nn_build/iqn_btr_from_config.py (classic IQN, multimodal IQN wrapper, BC full-IQN).

Environment Configuration

Located in the environment section of the config YAML.

Timing Configuration

tm_engine_step_per_action: int = 5

Number of game simulation steps per agent action

Controls temporal resolution of agent control. The game engine runs at 100Hz (10ms per step).

  • Lower values (1-3): Finer control, more actions per second, slower data collection

  • Higher values (5-10): Coarser control, faster data collection, easier learning

  • Current: 5 steps = 50ms per action = 20 actions/second

Tip

Start with 5 for most maps. Reduce to 3-4 for technical sections requiring precise control.

ms_per_tm_engine_step: int = 10

Milliseconds per simulation step (fixed by game engine)

TrackMania engine runs at 100Hz. Do not modify unless you understand game internals.

ms_per_action: int = 50

Total milliseconds per agent action (computed)

Calculated as: tm_engine_step_per_action × ms_per_tm_engine_step

With defaults: 5 × 10ms = 50ms (20 actions/second).

When multi-action is enabled (rl_action_offsets_ms has more than one offset), the rollout uses 10ms per step and this value is 10; one “decision” spans ms_per_block (see below).

Multi-action (N future actions)

rl_action_offsets_ms: list[int] = [0]

Temporal offsets in milliseconds for predicting N actions in one forward pass (BC-style).

  • Empty or ``[0]``: Single-action mode (one action per decision, current behavior). tm_engine_step_per_action controls step period.

  • e.g. ``[0, 10, 20, 30, 40]``: Multi-action mode. The policy outputs N=5 actions applied at 0, 10, 20, 30, 40 ms from the current state. Rollout runs at 10ms step period; one forward pass per block of N steps.

All values must be multiples of 10. Typically use consecutive 10ms steps (0, 10, 20, …).

n_actions_per_block: int = 1

Number of actions per block (computed from rl_action_offsets_ms).

Equals len(rl_action_offsets_ms) when multi-action is enabled, else 1. One replay transition = one block (state, N actions, sum of rewards over N steps, next state).

ms_per_block: int = 50

Milliseconds per decision block (computed).

When multi-action: n_actions_per_block × 10. When single-action: same as ms_per_action. Used for reward aggregation and mini-race duration in steps.

Spatial Configuration

n_zone_centers_in_inputs: int = 40

Number of waypoints (zone centers) used as input to the agent

The agent observes upcoming waypoints along the track centerline for spatial awareness.

  • Purpose: Provides lookahead vision of track geometry

  • Lower values (20-30): Less memory, less forward vision

  • Higher values (40-60): More forward vision, better long-term planning

  • Memory impact: Each waypoint adds 3 floats (X,Y,Z) to input

  • Current: 40 waypoints with 20-waypoint spacing covers ~400m ahead

one_every_n_zone_centers_in_inputs: int = 20

Sampling rate for waypoints

Only every N-th waypoint is fed to the network to reduce input dimensionality.

  • Lower values (10-15): Denser track representation, more inputs

  • Higher values (20-30): Sparser representation, faster inference

  • Current: Sample every 20th waypoint

With distance_between_checkpoints=0.5m, this gives waypoints every 10 meters.

distance_between_checkpoints: float = 0.5

Spacing between consecutive virtual checkpoints (meters)

Track is discretized into virtual checkpoints for progress tracking.

  • Lower values (0.3-0.5m): Finer progress tracking, more checkpoints

  • Higher values (0.5-1.0m): Coarser tracking, fewer checkpoints

  • Current: 0.5m provides good balance

Warning

Very low values (<0.3m) can cause performance issues with many checkpoints.

n_zone_centers_extrapolate_before_start_of_map: int = 20

Virtual waypoints before the start line

Number of extrapolated zone centers added before the actual track start. These virtual waypoints extend the track centerline backwards from the start line.

Why needed:

  • At race start, the car may be positioned before the start line (during countdown/initialization)

  • The zone tracking system needs valid zone indices even when the car is before the start

  • current_zone_idx is initialized to this value at rollout start

How it works:

  • Virtual waypoints are created by extrapolating backwards along the direction from the first real checkpoint to the second

  • This creates a smooth continuation of the track before the start

  • The system can track position and progress even before crossing the start line

Typical values: 10-30. Current: 20 provides sufficient buffer for initialization.

n_zone_centers_extrapolate_after_end_of_map: int = 1000

Virtual waypoints after the finish line

Number of extrapolated zone centers added after the actual track finish. These virtual waypoints extend the track centerline forwards from the finish line.

Why needed:

  • After crossing the finish, the car may continue moving forward

  • The system needs to track position and calculate distances even after finish

  • Used for calculating distance-to-finish notifications (see margin_to_announce_finish_meters)

  • Prevents zone tracking from breaking when the car overshoots the finish

How it works:

  • Virtual waypoints are created by extrapolating forwards along the direction from the last real checkpoint to the second-to-last

  • The agent cannot enter the final virtual zone (protected by a check in zone tracking)

  • Used to compute remaining distance to finish for reward shaping and state representation

Why so many (1000)?

  • After finishing, the car may coast for significant distance

  • Need enough virtual waypoints to cover potential overshoot

  • With distance_between_checkpoints=0.5m, 1000 waypoints = ~500 meters of virtual track

  • Ensures distance calculations remain valid even if the car travels far past finish

Typical values: 500-2000. Current: 1000 provides generous buffer for post-finish tracking.

road_width: int = 90

Maximum allowable lateral distance from centerline (meters)

Used to determine if car is on-track or off-track. Includes safety margin.

  • Purpose: Collision detection and checkpoint validation

  • Current: 90m is conservative (actual roads are 16-32m wide)

  • Typical range: 50-100m

Temporal Configuration - Mini-Races

What are mini-races?

Mini-races are a key technique that allows the agent to learn with gamma = 1 (no discounting) by reinterpreting each state as part of a fixed-duration “mini-race” rather than the full track trajectory. This simplifies learning and enables efficient credit assignment.

How it works:

  1. During training, when sampling a batch from the replay buffer, each transition is reinterpreted as part of a random 7-second “mini-race”

  2. A random “current time” (0 to 7 seconds) is sampled independently for each transition in the batch

  3. The state is interpreted as “we are at time X in a mini-race”

  4. If the next state would exceed 7 seconds, the transition becomes terminal

  5. Q-values represent “expected sum of rewards in the next 7 seconds” instead of “expected sum of discounted rewards until finish”

Important: How intervals are selected

The 7-second intervals are not sequential (0-7, 7-14, 14-21…). Instead:

  • Each transition from the real race can be reinterpreted as part of any random 7-second window

  • For example, a transition at 15 seconds into the race might be interpreted as: - “Time 0 in a mini-race” (covering 15-22 seconds of real race) - “Time 3.5 in a mini-race” (covering 11.5-18.5 seconds) - “Time 6.5 in a mini-race” (covering 8.5-15.5 seconds) - Any other random position

  • Intervals overlap extensively and are sampled randomly for each batch

  • This means the same real transition can be trained on as part of many different mini-race contexts

How does the agent learn the full track?

Even though Q-values only predict 7 seconds ahead, the agent still learns to drive the entire track efficiently:

  1. Local optimization → global optimization: By optimizing every 7-second segment along the track, the agent implicitly optimizes the full trajectory

  2. Overlapping coverage: Since intervals overlap and are randomly sampled, transitions from all parts of the track are trained with various mini-race contexts

  3. Greedy policy: At inference time, the agent greedily selects actions that maximize the 7-second Q-value, which naturally chains into a good full-track policy

  4. Reward structure: The rewards (progress, time penalty) encourage forward progress, so optimizing 7-second segments leads to efficient full-track driving

Benefits:

  • Simplified learning: No need to learn long-term value estimates for the entire track

  • Gamma = 1: Can use undiscounted returns because the horizon is naturally limited

  • Better credit assignment: Focuses learning on near-term consequences (7 seconds)

  • Stability: Avoids issues with very long episodes and sparse rewards

Implementation:

The mini-race logic is implemented in buffer_utilities.buffer_collate_function(), which is called during batch sampling. The first element of state_float contains the current time in the mini-race (in actions).

temporal_mini_race_duration_ms: int = 7000

Duration of mini-races (milliseconds)

The fixed horizon for each mini-race. All Q-values are defined as “expected sum of rewards in the next N milliseconds”.

  • Purpose: Defines the temporal horizon for value estimation

  • Current: 7000ms = 7 seconds

  • Typical range: 5000-10000ms (5-10 seconds)

Trade-offs:

  • Shorter (3-5s): Faster learning, but may miss long-term consequences

  • Longer (10-15s): Better long-term planning, but slower learning and more variance

  • Current (7s): Good balance for TrackMania’s typical decision-making horizon

With ms_per_action = 50ms, this equals 140 actions per mini-race.

temporal_mini_race_duration_actions: int = 140

Duration of mini-races in actions (computed automatically)

Automatically calculated as temporal_mini_race_duration_ms // ms_per_action.

Used internally for mini-race time calculations and terminal state detection.

margin_to_announce_finish_meters: int = 700

Distance threshold to notify agent of finish line (meters)

When the agent is within this distance of the finish, the distance-to-finish feature in the state is capped at this value. This provides a consistent signal as the agent approaches the finish.

Why needed:

  • The agent needs to know it’s approaching the finish to adjust behavior (e.g., maintain speed, avoid unnecessary actions)

  • Without capping, the distance feature would decrease rapidly near finish, creating a non-linear signal

  • Capping provides a stable “finish approaching” signal within the last 700 meters

How it’s used:

  • Included in state_float as one of the input features

  • Value is min(margin_to_announce_finish_meters, actual_distance_to_finish)

  • When far from finish (>700m), shows actual distance

  • When close (<700m), shows 700m (constant signal)

Typical values: 500-1000 meters. Current: 700m provides good advance warning for finish approach.

Contact and Physics

n_contact_material_physics_behavior_types: int = 4

Number of surface physics categories used as scalar input

TrackMania has many surface types (Concrete, Grass, Ice, Turbo, Dirt, etc.). They are grouped into a small number of physics behavior categories for the agent’s input. This parameter is the number of such categories that are explicitly encoded in the state.

How it works:

  • Surface types are grouped in trackmania_rl/contact_materials.py (e.g. Asphalt-like, Grass, Dirt, Turbo, and “other”)

  • For each of the 4 wheels, the game provides the current contact material ID

  • The state includes a one-hot-like encoding per wheel: for each category index 0 .. n_contact_material_physics_behavior_types - 1, one float indicates whether that wheel is on that surface category

  • Total floats from contact materials: 4 wheels × n_contact_material_physics_behavior_types = 4 × 4 = 16

Why needed:

  • Grip and behavior depend strongly on surface (asphalt vs grass vs turbo vs ice)

  • The agent needs to know which surface each wheel is on to predict handling and choose actions

  • Using a few physics groups keeps the input size small while preserving the main distinction (road vs off-road vs turbo vs other)

Current: 4 categories. The mapping from game materials to these groups is defined in contact_materials.py (e.g. 0 = Asphalt-like, 1 = Grass, 2 = Dirt, 3 = Turbo; other materials map to an implicit “other” and are not encoded as a separate category index).

Do not change unless you also change the grouping logic in contact_materials.py and the corresponding input dimension in the config.

n_prev_actions_in_inputs: int = 5

Number of previous actions included in the state (action history)

The state includes the last N actions taken by the agent, each encoded as 4 binary flags: accelerate, brake, left, right (see config_files/inputs_list.py). This gives the network a short history of what the car was doing.

How it works:

  • At each step, the last n_prev_actions_in_inputs actions are taken from rollout_results["actions"]

  • Each action is expanded to 4 floats: one per input name in ["accelerate", "brake", "left", "right"] (1.0 if that input is pressed, 0.0 otherwise)

  • They are concatenated in order (oldest to newest)

  • Total floats from action history: 4 × n_prev_actions_in_inputs = 4 × 5 = 20

Why needed:

  • The MDP is not fully Markovian from a single frame: steering and acceleration have inertia, and the current command is partly a continuation of the previous ones

  • Including the last few actions makes the state closer to Markovian and helps the policy produce smooth, consistent control (e.g. sustained turns instead of jitter)

  • Without action history, the agent would have to infer “I was turning left” from the image/state alone, which is harder and noisier

Trade-offs:

  • Larger (6–8): Longer memory of past actions, smoother behavior, more input dimensions

  • Smaller (3–4): Fewer parameters, but less context and possibly jerkier control

Current: 5 actions. With 50 ms per action, this is 250 ms of action history (~0.25 s).

Changing this value changes float_input_dim (add or subtract 4 per action); it is computed from config at load time.

Multi-action (N>1): The state stores the last n_prev_actions_in_inputs individual actions (flattened from blocks). Total action floats = 4 × n_prev_actions_in_inputs, same as single-action. Increase the value to keep the same temporal coverage (e.g. 25 at 10ms = 250ms ≈ 5 at 50ms).

Timeouts

cutoff_rollout_if_race_not_finished_within_duration_ms: int = 300000

Maximum race duration before forced termination (milliseconds)

Prevents infinite loops and stuck states. Race is cut off if not finished within this time.

  • Purpose: Prevent endless rollouts

  • Current: 300,000ms = 5 minutes

  • Typical range: 180,000-600,000ms (3-10 minutes)

cutoff_rollout_if_no_vcp_passed_within_duration_ms: int = 2000

Timeout if no checkpoint passed (milliseconds)

Detects when agent is stuck or driving backwards.

  • Purpose: Early termination of unproductive rollouts

  • Current: 2,000ms = 2 seconds

  • Typical range: 1,000-5,000ms

timeout_during_run_ms: int = 30000

TMInterface command timeout during active racing (milliseconds)

How long to wait for TMInterface responses while car is racing.

  • Current: 30,000ms = 30 seconds

  • Previous: 10,100ms (increased for stability)

Note

Higher values prevent timeout on lag spikes but slow error detection.

game_reboot_interval: int = 43200

Auto-restart interval for TrackMania (seconds)

Automatically reboots game to prevent memory leaks during long training sessions.

  • Current: 43,200s = 12 hours

  • Typical range: 21,600-86,400s (6-24 hours)

Game Settings

game_camera_number: int = 2

Camera view in TrackMania

  • 1: Behind car

  • 2: First person (recommended for RL)

  • 3: Top view

Current: 2 (first person) matches human driving perspective

sync_virtual_and_real_checkpoints: bool = True

Align virtual checkpoints with game checkpoints

Ensures custom virtual checkpoint progress matches official game checkpoint progress.

  • True: Virtual CPs synchronized with game CPs (recommended)

  • False: Independent virtual checkpoint system

Neural Network Configuration

Located in the nn section of the config YAML (schema: config_files/nn_schema.py).

Image Dimensions

W_downsized: int = 160

Width of captured game frames (pixels)

Game screenshots are resized to this width before being fed to the neural network.

  • Lower values: Faster training, less memory, reduced visual detail

  • Higher values: Better visual quality, slower training, more memory

  • Typical range: 128-256 pixels

  • Current: 160 pixels provides good balance

Note

The CNN output dimension is automatically calculated from these dimensions when the network is created. No manual configuration needed.

H_downsized: int = 120

Height of captured game frames (pixels)

Game screenshots are resized to this height before being fed to the neural network.

  • Typical range: 96-192 pixels

  • Current: 120 pixels (4:3 aspect ratio with W=160)

Note

The CNN output dimension is automatically calculated from these dimensions when the network is created. No manual configuration needed.

Input Dimensions

float_input_dim: int = 191

Total dimension of scalar (non-image) inputs (computed)

Size of feature vector fed to network alongside images.

Breakdown of 27 base features:

  • 1: Time remaining in mini-race

  • 20: Previous action encodings (5 actions × 4 binary flags)

  • 4: Car gear information

  • 2: Speed-related features

Dynamic features:

  • 3 × n_zone_centers_in_inputs (40): Waypoint X,Y,Z coordinates

  • 4 × n_prev_actions_in_inputs (5): Recent action history

  • 4 × n_contact_material_physics_behavior_types (4): Wheel contact materials

  • 1: Additional feature (freewheeling flag)

Current: 27 + 120 + 20 + 16 + 1 = 184 features

For a detailed mapping of each segment to game data (SimStateData) and a list of game fields we do not use, see Game inputs and float observation vector.

State Normalization (float_inputs_mean / float_inputs_std)

Defined in the state_normalization section of the config YAML (or built from defaults in the loader).

Scalar inputs are normalized before the network: (float_inputs - float_inputs_mean) / float_inputs_std. This keeps activations in a reasonable range and can speed up training.

How were these values obtained?

The repo does not include a script that computes them from data. The current values are a mix of:

  1. Domain-derived: From config or known ranges. Examples: - First feature (mini-race time): mean = temporal_mini_race_duration_actions / 2 (70), std = 70. - Distance to finish: mean = margin_to_announce_finish_meters (700), std = 350.

  2. Typical for binary/bounded inputs: For action flags (0/1) and wheel/gear floats, mean and std are set to plausible “typical” values (e.g. 0.5 for symmetric binary, 0.8/0.2 for “often accelerate”).

  3. Possibly empirical: The 40 waypoint coordinates (120 floats) have non-round means and stds (e.g. -2.1, 9.5, 19.1, 28.5…), which suggests they may have been computed as sample mean and std over rollouts on one or more maps in the past, then hardcoded.

How to recompute from your own data:

  1. Collect many state_float vectors (same order as in game_instance_manager.py: placeholder, previous_actions, gear/wheels, angular velocity, velocity, y_map, zone_centers, distance_to_finish, freewheeling). During training, the first element is overwritten with mini-race time in buffer_collate_function, so for normalization you can either use the “raw” state from the game or the post-collate state from the buffer.

  2. Stack into a matrix of shape (N, float_input_dim).

  3. Compute mean = np.nanmean(data, axis=0) and std = np.nanstd(data, axis=0) (or replace zeros in std with a small constant to avoid division by zero).

  4. Update float_inputs_mean and float_inputs_std in state_normalization.py.

If you change the number or order of float features (e.g. waypoints, actions), the length and indices in the config’s state normalization must match float_input_dim and the order in game_instance_manager.py / buffer_utilities.py.

Network Architecture

Note

Authoritative YAML paths for widths and vision are under nn: (Neural network YAML (nn) — full reference). The py:data entries below describe the flat names exposed on get_config() for backward compatibility.

float_hidden_dim: int = 256

Hidden layer size for scalar feature processing

Transforms float_input_dim features before merging with visual features.

  • Lower values (128-192): Faster, less capacity

  • Higher values (256-512): More capacity, slower

  • Typical range: 128-512

dense_hidden_dimension: int = 1024

Main hidden layer size

Primary representation layer after combining visual and scalar features.

  • Lower values (512-768): Faster training, less capacity

  • Higher values (1024-2048): More model capacity, slower

  • Typical range: 512-2048

  • Current: 1024 provides strong capacity for complex tracks

IQN Parameters

Implicit Quantile Networks (IQN) model the full return distribution rather than just expected value.

See: Dabney et al. 2018 - Implicit Quantile Networks for Distributional RL

iqn_embedding_dimension: int = 64

Quantile embedding dimension

Controls how finely the return distribution is modeled.

  • Higher values: More expressive distribution modeling

  • Typical range: 32-128

  • Paper recommendation: 64

iqn_n: int = 8

Number of quantile samples during training

How many quantile samples to draw when computing training loss.

  • Must be even (sampled symmetrically around 0.5)

  • Higher values: More stable gradients, slower training

  • Typical range: 8-32

  • Paper recommendation: 8 for training

iqn_k: int = 32

Number of quantile samples during inference

How many quantile samples for action selection during rollouts.

  • Must be even (sampled symmetrically around 0.5)

  • Higher values: Better action selection, slower inference

  • Typical range: 8-64

  • Paper recommendation: 32 for evaluation

iqn_kappa: float = 0.005

Huber loss threshold

Transition point between L1 and L2 loss in quantile Huber loss.

  • Lower values: More robust to outliers (L1-like)

  • Higher values: More sensitive to all errors (L2-like)

  • Typical range: 1e-3 to 1.0

  • Paper default: 1.0

  • Current: 5e-3 works better empirically

Q-learning variant

use_ddqn: bool = False

Use Double DQN for target computation

Switches how the TD target for the next state is computed in the IQN training loop (see trackmania_rl/agents/iqn.py, train_on_batch).

When False (standard DQN-style):

  • Target: reward + gamma * max_a Q_target(next_state, a)

  • The target network both selects the best action and evaluates it, which can lead to overestimation of Q-values.

When True (Double DQN):

  • The online network selects the action: a* = argmax_a Q_online(next_state, a) (after averaging over quantiles).

  • The target network evaluates that action: target = reward + gamma * Q_target(next_state, a*).

  • Selecting and evaluating are decoupled, which reduces overestimation and often stabilizes training.

Effect:

  • True: Usually more stable learning, less Q overestimation; one extra forward pass through the online network per batch.

  • False: Slightly faster per batch; may overestimate Q more.

See: van Hasselt et al. 2016 - Deep Reinforcement Learning with Double Q-learning

Gradient Clipping

clip_grad_value: float = 1000

Maximum absolute gradient value

Clips individual gradient elements to [-value, +value].

  • Purpose: Prevent exploding gradients

  • Current: 1000 (effectively disabled)

  • Typical range: 10-1000

clip_grad_norm: float = 30

Maximum gradient L2 norm

Scales gradient if its L2 norm exceeds this value.

  • Purpose: Prevent exploding gradients by total magnitude

  • Lower values: More aggressive clipping, more stable

  • Higher values: Less clipping, faster learning, less stable

  • Typical range: 10-100

Target Network

number_memories_trained_on_between_target_network_updates: int = 2048

Target network update frequency (in transitions)

How often to update target network from online network.

  • Purpose: Stable Q-value targets during training

  • Higher values: More stable but slower improvement propagation

  • Current: 2048 transitions ≈ 4 batches (with batch_size=512)

  • Typical range: 1000-10000

soft_update_tau: float = 0.02

Soft update coefficient (tau)

Interpolation factor: target = tau × online + (1 - tau) × target

  • tau=1.0: Hard update (full copy)

  • tau=0.0: No update

  • Lower values: Smoother, more stable updates

  • Current: 0.02 is relatively aggressive

  • Typical range: 0.001-0.1

Training Configuration

Located in the training section of the config YAML.

Run Identification

run_name: str = "uni_3"

Experiment identifier

Used for:

  • Tensorboard log directory naming

  • Model checkpoint naming

  • Distinguishing multiple experiments

Example: "uni_3", "A02_training", "experiment_v2"

algorithm: str = "iqn"

Which RL algorithm / wiring to run

  • iqn — Off-policy IQN (default): replay buffer, target network, two weight files. Exploration uses exploration.* schedules.

  • ppo — On-policy PPO with actor-critic: no replay in the learner, no weights2.torch. Stochastic policy from the network; collectors skip IQN ε schedules. PPO-specific loss hyperparameters under ppo:; discount \(\gamma\) for rollout reward shaping and GAE is configured under training.policy_rollout_gamma / training.policy_rollout_gamma_schedule (preferred) or legacy ppo.gamma / ppo.ppo_gamma_schedule. Network routing: nn.fusion_mode + nn.vis + nn.float + nn.encoder (see Neural network YAML (nn) — full reference). See PPO configuration (ppo:). Diagrams: PPO actor-critic architecture.

  • dpo — Direct preference optimization: same actor-critic stack and collector contract as PPO (stochastic policy, weights1.torch only). The learner forms preference pairs from an online buffer and/or offline JSONL (see DPO configuration (dpo:)). Hyperparameters under dpo: with a dpo_* prefix. Per-step rollout rewards use the same training.policy_rollout_gamma* builder as PPO (no PPO loss). Network / collector diagrams (shared with PPO): PPO actor-critic architecture.

  • grpo — Group relative policy optimization: same wiring as PPO; the learner waits for grpo_group_size rollouts, centers returns within each group as advantages, and optimizes the policy (optional KL to a periodic reference). Hyperparameters under grpo: with grpo_* prefix (see GRPO configuration (grpo:)). Rollout rewards use training.policy_rollout_gamma* (no PPO loss). Diagrams: GRPO: network and training (training); actor-critic topology in PPO actor-critic architecture.

PPO configuration (ppo:)

The PPO clipped objective runs only when training.algorithm: ppo. GAE (with \(\lambda\) from ppo:) runs only for PPO; discount :math:`gamma` for GAE is the same scalar/schedule as for rollout reward shaping: training.policy_rollout_gamma / training.policy_rollout_gamma_schedule when set, else legacy ppo.gamma / ppo.ppo_gamma_schedule. When algorithm is dpo or grpo, only the shared rollout builder uses that \(\gamma\) (no GAE, no PPO clip). The IQN learner and replay buffer are not used for ppo, dpo, or grpo; memory.* / n-step / priority settings have no effect on those paths.

For actor-critic topology (CNN, optional HF backbone, multimodal fusion), see PPO actor-critic architecture. For the GRPO learner (grouped rollouts, advantages, loss), see GRPO: network and training.

Reference YAML: config_files/rl/config_ppo.yaml (general). Narrower baselines: config_ppo_cnn_mlp.yaml (CNN + float MLP only, nn.fusion_mode: none), config_ppo_transformer.yaml (post_concat + HF vision + HF fusion encoder), config_ppo_post_concat_cnn_tf.yaml (post_concat + CNN vision + native fusion transformer). Those files trim IQN-only keys from training: (e.g. no batch_size, gamma_schedule, replay/pretrain paths). The shared nn.decoder / nn.iqn blocks remain in the schema for parity with IQN configs but are unused by the PPO forward path.

Flat access: keys under ppo: are exposed on get_config() without a .ppo prefix (e.g. cfg.clip_coefppo.clip_coef). Rollout \(\gamma\) flat names: cfg.policy_rollout_gamma, cfg.policy_rollout_gamma_schedule (from training:).

Checkpoints

weights1.torch (policy), optimizer1.torch, and scaler.torch are written on each periodic save (see save_ppo_checkpoint in trackmania_rl/utilities.py). When a HF vision backbone or native fusion transformer is enabled, hf_transformer_vis/ and/or hf_transformer_fusion/ are also updated in transformers layout. There is no target network file.

Fields under ``ppo:`` (class PPOConfig in config_files/config_schema.py — all are read by the PPO learner / loss)

rollout_steps_per_update (int)

Total number of environment steps the learner waits for (summed over all rollout queues) before running one PPO update. Collectors push variable-length rollouts; the learner concatenates them until at least this many steps are available, then trains and clears the buffer.

gamma (float)

Legacy fallback discount for on-policy rollout reward shaping and PPO GAE when training.policy_rollout_gamma / training.policy_rollout_gamma_schedule are unset. Prefer training.policy_rollout_gamma* for clarity (see scheduled_rollout_shaping_gamma in trackmania_rl/multiprocess/policy_rollout_batch.py).

gae_lambda (float)

GAE \(\lambda\) for advantage estimation (bias/variance tradeoff).

clip_coef (float)

PPO surrogate clip range \(\varepsilon\) (ratio \(\pi/\pi_{\mathrm{old}}\) is clipped to \([1-\varepsilon, 1+\varepsilon]\)).

vf_coef (float)

Coefficient for the value-function (critic) loss term in the combined PPO objective.

ent_coef (float)

Coefficient for the policy entropy bonus (encourages exploration via stochastic logits; there is no separate IQN-style ε-greedy on PPO — exploration comes from sampling + this term).

max_grad_norm (float)

Max norm for gradient clipping on the policy parameters after each optimizer minibatch.

update_epochs (int)

After each environment data collection phase, the learner has a fixed set of transitions (length T timesteps — see below). It runs the PPO loss this many times over that same data. Each epoch:

  • builds a random permutation of all T indices (full shuffle, not sequential),

  • steps through that order in chunks of size mb (see num_minibatches).

So one PPO update performs about update_epochs × ceil(T / mb) optimizer steps on the same rollout batch (with different shuffles each epoch). Higher update_epochs → stronger use of each collected step (better sample efficiency per environment step) but a larger risk of overfitting that batch and overshooting the trust region PPO assumes (policy can move too far from π_old). Typical values are small (e.g. 3–10).

num_minibatches (int)

Controls minibatch size for SGD inside one epoch. The learner sets mb = max(1, T // num_minibatches) (integer division) and, for each epoch, slices the shuffled indices with stride mb. There is no YAML key named batch_steps: the batch length is T, the number of timesteps in the concatenated rollout tensor after the learner has accumulated at least rollout_steps_per_update steps (T is usually that threshold and can be a bit larger depending on how rollouts arrive).

Effect on training: Larger num_minibatches (for fixed T) → smaller mb → more, noisier gradient steps per epoch (more like SGD), lower peak GPU memory per step. Smaller num_minibatcheslarger mb → fewer, smoother updates per epoch. If T is not divisible by num_minibatches, the last chunk in an epoch can be shorter; you can occasionally get one extra partial minibatch.

normalize_advantages (bool)

If true, advantages are normalized to zero mean and unit variance within each PPO update before the policy loss.

Optional piecewise-linear schedules (under ``ppo:``)

Same list shape as training.lr_schedule: [[frame0, value0], [frame1, value1], ...] with cumulative environment frames on the first element. Values are linearly interpolated between knots (unlike lr_schedule, which uses exponential interpolation in code).

Frame counts in these lists are multiplied by training.global_schedule_speed at load time (same as lr_schedule).

If a schedule key is omitted, the scalar above is used for all time (equivalent to a single knot [[0, scalar]]).

  • ppo_gamma_schedulelegacy piecewise \(\gamma\) for rollout shaping + GAE when training.policy_rollout_gamma_schedule is omitted. Prefer training.policy_rollout_gamma_schedule.

  • gae_lambda_schedule — GAE \(\lambda\) per frame index.

  • ent_coef_schedule — entropy bonus coefficient in the PPO loss.

  • vf_coef_schedule — value-loss coefficient.

Flat access: get_config().gae_lambda_schedule, get_config().ent_coef_schedule, etc. (YAML under ppo:). Rollout \(\gamma\) schedules: get_config().policy_rollout_gamma_schedule (training:).

``nn.vis`` (image branch)

image_size (width / height; field aliases w / h). Flat get_config().w_downsized / h_downsized read from vis.image_size.

Exactly one image encoder: either cnn: { ... } or transformer: { ... } (TransformersConfig — ViT/patch/HF). no_image: true means float-only.

freeze: true under vis freezes the vision stem during RL; see RL parameter freeze.

``nn.fusion_mode`` / ``nn.init_from_pretrained`` (global PPO routing)

Top-level under nn: (not under encoder). fusion_mode: none | vision_transformer | post_concat | unified (see PPO actor-critic architecture). init_from_pretrained is an optional Rulka save_pretrained dir when fusion_mode != none. YAML that still lists these keys under nn.encoder is accepted and hoisted to nn at load. Flat get_config().fusion_mode / transformers / fusion still work via ConfigView.

``nn.encoder`` (fusion trunk stack only)

Optional mlp (float MLP override for fusion policies; else nn.float.mlp) and transformer (stack TransformersConfig for the multimodal encoder). Scalar float width is nn.float.mlp.hidden_dim. IQN decoder.advantage / decoder.value are either an mlp block (hidden_dim / alias hidden, layers / alias n_hidden_layers) or a native transformer stack (TransformerStackConfig; use_hf_backbone is reserved / rejected until wired).

  • Wiring: fusion_mode != none → fusion module; fusion_mode == none and vis.transformer.use_hf_backboneHfActorCritic; else CNN trunk from vis.cnn.

  • unified requires vis.transformer.d_model == encoder.transformer.d_model.

IQN ignores this block except that the nested object exists in the schema.

Periodic PPO saves: hf_transformer_vis/ when fusion_mode == none and vis.transformer.use_hf_backbone; hf_transformer_fusion/ when fusion_mode != none.

Install when using HF:

pip install -e ".[policy]"

Keep nn.training.use_jit: false for standalone HF unless you have validated torch.compile for that backbone.

Pretrain / IQN-only scripts

Encoder injection and BC-to-IQN helpers under scripts/ apply to IQN. With algorithm in ppo, dpo, or grpo, train.py warns if IQN-only pretrain paths are set.

Freezing after loading pretrained weights is configured under nn (vis.freeze, decoder.advantage.freeze, …), not under top-level training:; see RL parameter freeze.

DPO configuration (dpo:)

Used when training.algorithm: dpo. Same policy network and checkpoint layout as PPO (see PPO actor-critic architecture for Variants A/B/C, process stack, and rollout-queue payload shared with DPO). A frozen reference policy \(\pi_{\mathrm{ref}}\) is kept in memory; the DPO loss uses trajectory-level log-probability gaps \(\log\pi_\theta - \log\pi_{\mathrm{ref}}\) for the chosen vs rejected trajectory. Class DPOConfig lives in config_files/config_schema.py.

Flat access: keys under dpo: are exposed on get_config() with the dpo_* prefix (e.g. cfg.dpo_beta), so they do not collide with ppo: scalars.

Data

  • dpo_data_modeonline | offline | both. Online: each valid on-policy rollout (tensorized like PPO) may be appended to a FIFO buffer of size dpo_pair_buffer_max. When the buffer has at least two entries, the learner picks the rollout with largest and smallest scalar score, where the score is the sum of per-step shaped rewards in that rollout’s tensor batch (same rewards used in the learner). Those two trajectories become win/lose for one optimization round, then both are removed from the buffer. If the best and worst index coincide (e.g. all scores equal), no pair is formed until more rollouts arrive. Offline: pairs are read from dpo_offline_pairs_jsonl. Both: uses an offline pair when len(pair_buffer) < 2 or when update_count % 2 == 1 (odd pair-update index), else tries online pairing from the buffer. Rollouts need not be full track finishes; length is whatever the environment/collector produced, subject to the usual minimum steps for valid tensors.

  • dpo_pair_buffer_max — maximum online rollouts retained in the FIFO buffer for pairing (each slot is one full episode rollout, once successfully converted to policy tensors).

  • dpo_offline_pairs_jsonl — path to JSONL; each line is {"chosen": "<path>", "rejected": "<path>"} where each path loads a joblib tuple (rollout_results, end_race_stats) (same objects as the multiprocess queue). Append lines with scripts/dpo_append_offline_pair.py.

Loss / optimization

  • dpo_beta — temperature \(\beta\) in the DPO preference loss. Optional dpo_beta_schedule — piecewise-linear vs cumulative env frames (same axis as training.lr_schedule; scaled by training.global_schedule_speed). If omitted, the scalar applies at all times (implicit [0, scalar]).

  • dpo_vf_coef — weight on auxiliary MSE value-head regression to the values stored at collection time (old_values in the rollout tensors). Optional dpo_vf_coef_schedule (same schedule semantics as dpo_beta_schedule).

  • dpo_update_epochs — optimizer steps per trajectory pair: the learner runs this many forward/backward passes on the same win/lose pair before moving on (each pass uses the current policy weights).

  • dpo_max_grad_norm — argument to torch.nn.utils.clip_grad_norm_ after each backward within that inner loop. Optional dpo_max_grad_norm_schedule (same linear frame axis).

  • dpo_num_minibatches — reserved for future within-pair minibatching; the current DPO learner does not use it (mirrors PPO/GRPO naming).

Reference policy

  • dpo_ref_sync_every_updatesreference sync period in pair updates: after each successful optimization on one trajectory pair (after all dpo_update_epochs steps for that pair), the learner increments a counter; every N-th such increment copies the live training weights into ref_policy (load_state_dict from uncompiled_local), where N is this setting. Smaller N keeps \(\pi_{\mathrm{ref}}\) closer to \(\pi_\theta\); larger N uses a slower-moving reference.

Reference YAML: config_files/rl/config_dpo.yaml. Rollout per-step rewards: training.policy_rollout_gamma* (see policy_rollout_gamma above).

GRPO configuration (grpo:)

Used when training.algorithm: grpo. Same network and checkpoints as PPO. The learner waits until it has grpo_group_size successful tensorized rollouts, forms group-relative advantages from each trajectory’s total return, and applies a policy-gradient-style objective plus an entropy bonus. Class GRPOConfig lives in config_files/config_schema.py. Narrative diagrams and comparison to PPO: GRPO: network and training.

Flat access: grpo_* on get_config() (e.g. cfg.grpo_group_size).

Group / returns

  • grpo_group_size — number of full rollouts aggregated before one learner update (all must yield valid on-policy tensors).

  • grpo_normalize_groupmean subtracts the group mean return from each trajectory; mean_std additionally divides by the group standard deviation (with a small stabilizer in code).

Loss / optimization

  • grpo_ent_coef — entropy bonus coefficient in the GRPO loss. Optional grpo_ent_coef_schedule — piecewise-linear vs cumulative env frames (same axis as training.lr_schedule; scaled by training.global_schedule_speed). If the schedule is omitted, the scalar applies at all times (implicit [0, scalar]).

  • grpo_max_grad_norm — argument to torch.nn.utils.clip_grad_norm_ after each backward. Optional grpo_max_grad_norm_schedule (same schedule semantics).

  • grpo_update_epochs — passes over the grouped batch per update.

  • grpo_num_minibatches — reserved for future within-group minibatching; the current GRPO learner does not use it (mirrors PPO/DPO naming).

Reference policy / KL

  • grpo_ref_kl_coef — weight on the mean \(\log \pi(a|s) - \log \pi_{\mathrm{ref}}(a|s)\) term vs frozen ref_policy. Set to 0 to disable the KL term entirely (no reference forward passes in the loss).

  • grpo_ref_sync_every_updatesreference sync period in group updates: after each group optimization, the learner increments a counter and every N-th update copies the live policy into ref_policy (load_state_dict), where N is this setting. Smaller N keeps the KL reference fresher; larger N uses a slower-moving reference. When grpo_ref_kl_coef is 0, the KL term is disabled; syncs may still run but do not affect the loss.

  • Optional grpo_ref_kl_coef_schedule — same linear frame axis as the other grpo_*_schedule fields.

Reference YAML: config_files/rl/config_grpo.yaml. Rollout per-step rewards: training.policy_rollout_gamma*.

Schedules

All schedules are lists of (cumulative_frames, value) tuples. Values are linearly interpolated between schedule points unless noted otherwise (training.lr_schedule uses exponential interpolation in from_exponential_schedule).

Piecewise-linear schedules: training.policy_rollout_gamma_schedule (on-policy rollout shaping \(\gamma\), all of PPO/DPO/GRPO) and, under ppo:, gae_lambda_schedule, ent_coef_schedule, vf_coef_schedule use linear interpolation vs frames; see PPO configuration (ppo:). Under dpo:, dpo_beta_schedule, dpo_vf_coef_schedule, and dpo_max_grad_norm_schedule use the same linear rule; see DPO configuration (dpo:). Under grpo:, grpo_ent_coef_schedule, grpo_max_grad_norm_schedule, and grpo_ref_kl_coef_schedule use the same linear rule; see GRPO configuration (grpo:). Legacy ppo.ppo_gamma_schedule applies if training.policy_rollout_gamma_schedule is unset.

global_schedule_speed: float = 1.0

Global schedule multiplier

Multiplies all frame counts in schedules by this value. Larger value = longer training (schedules stretch); smaller value = shorter training (schedules compress).

  • 1.0: Normal speed

  • >1.0: Slower schedules (e.g. 4 → each step at 4× more frames)

  • <1.0: Faster schedules (e.g. 0.8 → each step at 0.8× frames)

Useful for adjusting training duration without editing all schedules.

Also scaled (first coordinate of each [frame, value] pair): training.policy_rollout_gamma_schedule, ppo.ppo_gamma_schedule (legacy), ppo.gae_lambda_schedule, ppo.ent_coef_schedule, ppo.vf_coef_schedule (see PPO configuration (ppo:)), under dpo:dpo_beta_schedule, dpo_vf_coef_schedule, dpo_max_grad_norm_schedule (see DPO configuration (dpo:)), and under grpo:grpo_ent_coef_schedule, grpo_max_grad_norm_schedule, grpo_ref_kl_coef_schedule (see GRPO configuration (grpo:)).

Optimizer

adam_epsilon: float = 0.0001

Adam epsilon for numerical stability

Small constant added to denominator in Adam’s adaptive learning rate.

  • Paper default: 1e-8

  • Current: 1e-4 helps with stability

  • Typical range: 1e-8 to 1e-3

adam_beta1: float = 0.9

Adam beta1 (first moment decay)

Controls gradient direction smoothing.

  • Paper default: 0.9 (recommended)

  • Typical range: 0.9-0.99

adam_beta2: float = 0.999

Adam beta2 (second moment decay)

Controls gradient magnitude smoothing.

  • Paper default: 0.999 (recommended)

  • Typical range: 0.99-0.999

batch_size: int = 512

Training batch size

Number of transitions sampled from replay buffer per training step.

Larger batches:

  • ✅ More stable gradient estimates

  • ✅ Better GPU utilization

  • ❌ More memory usage

  • ❌ Less frequent weight updates per frame

  • Typical range: 32-512

  • Current: 512 is aggressive but works with large buffers

lr_schedule: list = [(0, 0.001), (3000000, 5e-05), (12000000, 5e-05), (15000000, 1e-05)]

Learning rate schedule

Controls gradient descent step size throughout training.

Strategy:

  • Start high (1e-3) for rapid initial learning

  • Decay to 5e-5 at 3M frames for stable convergence

  • Maintain until 12M frames

  • Final decay to 1e-5 at 15M frames for fine-tuning

Rationale: Large steps for exploration, small steps for exploitation.

Why a separate schedule is needed:

Even though RAdam optimizer is used (which has built-in variance reduction through rectification), a separate learning rate schedule is still necessary for the following reasons:

  • RAdam does not have built-in warmup: While RAdam’s rectification mechanism reduces variance in early training stages, it does not provide explicit learning rate warmup or decay functionality.

  • Decay is essential: The schedule provides explicit learning rate decay throughout training, which is crucial for stable convergence and fine-tuning. The exponential decay between schedule points allows smooth transitions.

  • Frame-based updates: Unlike standard PyTorch schedulers (which update based on optimizer steps via scheduler.step()), this schedule updates based on cumul_number_frames_played, which better matches the RL training paradigm where learning rate should be tied to environment interactions rather than optimization steps.

  • Custom interpolation: The implementation uses exponential interpolation between schedule points, providing smoother decay than step-based schedulers.

gamma_schedule: list = [(0, 0.999), (1500000, 0.999), (2500000, 1)]

Discount factor (gamma) schedule

How much to value future rewards vs immediate rewards.

  • gamma → 1.0: Value long-term consequences (far-sighted)

  • gamma → 0.0: Value immediate rewards (myopic)

Strategy:

  • Start at 0.999 (slight discounting)

  • Transition to 1.0 by 2.5M frames (undiscounted)

Rationale: TrackMania benefits from long-term planning. Gamma=1.0 treats all future rewards equally.

On-policy algorithms: rollout reward shaping \(\gamma\) uses training.policy_rollout_gamma_schedule (preferred) or legacy ppo.ppo_gamma_schedule (see PPO configuration (ppo:)). IQN only: this training.gamma_schedule is n-step discount — different semantics; do not mix with policy-rollout \(\gamma\).

policy_rollout_gamma: float | None = None

On-policy rollout reward shaping (PPO / DPO / GRPO)

Scalar \(\gamma\) in potential-based folding when building per-step rewards from env rollouts, when policy_rollout_gamma_schedule is unset and ppo.ppo_gamma_schedule is unset. If this field is null/omitted, the code uses ppo.gamma. Set explicitly (e.g. 0.99) together with policy_rollout_gamma_schedule for the step-0 anchor when using a schedule.

policy_rollout_gamma_schedule: list | None = None

Scheduled :math:`gamma` for on-policy rollout tensors

Same [[frame, value], ...] shape as lr_schedule knot list; linear interpolation; frame counts multiplied by global_schedule_speed at load. Takes precedence over ppo.ppo_gamma_schedule. Used by build_policy_rollout_tensors and by PPO GAE (same \(\gamma\) as shaping).

N-Step Learning

n_steps: int = 3

Number of steps for n-step returns

How many steps ahead to bootstrap Q-value estimate.

  • n=1: TD(0) - high bias, low variance

  • n>1: Multi-step - lower bias, higher variance

  • n=∞: Monte Carlo - no bias, maximum variance

Higher n:

  • ✅ Faster credit assignment

  • ✅ Lower bias

  • ❌ Higher variance

  • Typical range: 1-5

  • Current: 3 balances credit assignment and variance

discard_non_greedy_actions_in_nsteps: bool = True

Exclude exploratory actions from n-step returns

  • True: Only greedy actions in n-step backup (recommended)

  • False: Include all actions

Reduces exploration bias with epsilon-greedy.

Temporal Training Parameters (Mini-Races)

These parameters bias how the current time in the mini-race is sampled when forming batches in buffer_collate_function, and (when PER is enabled) which transitions get their priority updated.

What is “current time” in the mini-race?

The mini-race is always 7 seconds (140 actions). For each transition in a batch we randomly choose where we are within that window: a number from 0 to about 139 (in actions). That is the “current time” in the mini-race: 0 = start of the window, 139 = near the end. This number drives how many steps are left until the “finish” of the mini-race and whether the transition is terminal.

How is this time computed (step by step)?

The code does not draw uniformly from [0, 140). It does:

  1. Extended range: Draw an integer from [low, high) with low = -oversample_long_term_steps + oversample_maximum_term_steps (= -35), high = temporal_mini_race_duration_actions + oversample_maximum_term_steps (= 145). So we draw from [-35, 145) — including negatives and slightly above 140. That is the “extended” range: wider than the “honest” 0..139.

  2. abs(): Take the absolute value. Negatives -35..-1 become 35..1. So values 1, 2, …, 35 now each appear twice (once from a positive draw, once from a negative), while 0, 36, 37, …, 144 appear once. So the probability of values 1–35 is about twice the probability of the rest.

  3. Shift (-5): Subtract oversample_maximum_term_steps (= 5) from the result. The range shifts left; some values become negative.

  4. clip(min=0): Replace all negatives by 0. The final “current time” is in [0, 139].

So: the final time is still inside the 7-second window (0..139 actions). But the distribution is not uniform: values roughly 0–35 (start of the mini-race, “many steps left”) are more likely — that is the oversampling of the “long horizon”.

oversample_long_term_steps: int = 40

Oversample “long horizon” (many steps left in the mini-race)

The larger this number, the wider the early part of the mini-race (times 0, 1, 2, …) that gets higher probability from the abs() step. With 40, the negative part of the extended range is -35..-1; after abs(), the probability of values 1..35 is doubled. So transitions with “many steps left” in the 7-second window appear in the batch more often than under uniform sampling — they are oversampled.

Why: Long-horizon transitions are often more useful for learning value; oversampling them can improve credit assignment and stability.

Typical range: 20–60. Current: 40 (about the first 1.4 s of the 7 s).

oversample_maximum_term_steps: int = 5

Shift and upper bound of the extended range

Used in the same formula: (1) subtract 5 after abs(), and (2) upper bound of the extended range = 140 + 5 = 145. The shift keeps the final time in a sensible range after clip(min=0) and keeps the high end (139) reachable. Without the extra width and shift, the top values could be cut off.

Typical range: 1–10. Current: 5.

min_horizon_to_update_priority_actions: int = 100

Minimum horizon (in actions) for PER priority updates (computed)

Automatically set to temporal_mini_race_duration_actions - 40 (e.g. 140 - 40 = 100). Used only when PER is enabled (prio_alpha > 0).

How it works: After each training step, PER updates the priority of the sampled transitions from the TD-error. The code updates priority only for transitions whose current time in the mini-race (first element of state_float) is less than min_horizon_to_update_priority_actions. So transitions that were interpreted as “near the end” of the mini-race (e.g. time 100–140) do not get their priority updated.

Why: Short-horizon transitions (few steps left in the mini-race) have small TD-targets and noisier TD-errors; using them to update PER priorities can be misleading. Restricting updates to “long-horizon” samples (at least 40 actions left) keeps priorities more meaningful.

Summary: Only transitions with “current time” < 100 (i.e. at least 40 actions remaining in the mini-race) get their PER priority updated. Do not change unless you also change oversample_long_term_steps or the mini-race duration.

TensorBoard Logging

tensorboard_suffix_schedule: list = [(0, ""), (6000000, "_2"), (15000000, "_3"), (30000000, "_4"), (45000000, "_5"), (80000000, "_6"), (150000000, "_7")]

TensorBoard log directory suffix schedule

Controls when new TensorBoard log directories are created during long training runs.

How it works:

  • When a schedule point is reached, a new SummaryWriter is created with a new suffix

  • Log directory format: {run_name}{suffix}/ (e.g., uni_4/, uni_4_2/, uni_4_3/)

  • Uses staircase schedule (no interpolation - switches at exact frame counts)

Why split logs:

  • Performance: TensorBoard can slow down significantly with very large log files (>100M data points)

  • Organization: Easier to analyze separate training phases

  • Comparison: Compare different training stages side-by-side

  • File size: Prevents single log files from becoming too large

When to use:

  • Long training runs (>30M frames): Recommended to split logs

  • Short training runs (<30M frames): Can use single log (set to [(0, "")])

  • Very long runs (>100M frames): Essential for TensorBoard performance

Example: For run_name = "uni_4":

  • 0-6M frames: logs to uni_4/

  • 6-15M frames: logs to uni_4_2/

  • 15-30M frames: logs to uni_4_3/

  • And so on…

Memory Configuration

Located in the memory section of the config YAML.

Buffer Size

Why the replay buffer is needed

The agent uses off-policy RL (IQN/DQN): it learns from past experience stored in a replay buffer, not only from the current rollout. The buffer is needed for:

  1. Sample efficiency — Each transition (state, action, reward, next state) is expensive: it comes from real-time play. The buffer lets the learner reuse each transition many times (see number_times_single_memory_is_used_before_discard). Without a buffer, we would train only once per frame.

  2. Breaking temporal correlation — Consecutive frames from one run are highly correlated. Training on them in sequence would destabilize learning. Sampling random mini-batches from the buffer decorrelates the data and stabilizes gradient updates.

  3. Stable gradients — Training on diverse, randomly sampled transitions approximates i.i.d. data and helps the Q-network converge.

memory_size_schedule controls how large this buffer is and when learning is allowed to start.

memory_size_schedule: list = [(0, (50000, 20000)), (5000000, (100000, 75000)), (7000000, (200000, 150000))]

Replay buffer size schedule

Format: (frames, (total_size, start_learning_size))

  • total_size: Maximum number of transitions in the buffer. Limits both RAM use and how much past experience is kept. Larger buffers store more diverse data and allow more reuse per transition, but use more memory and take longer to fill.

  • start_learning_size: Minimum number of transitions that must be in the buffer before training starts. Ensures the first updates use reasonably diverse data instead of a few early rollouts; avoids overfitting to the initial exploration.

The schedule is applied by cumulative frames played: each entry is (frames_played, (total_size, start_learning_size)). Sizes grow over training so that:

  • Early (small buffer): The buffer fills quickly, learning starts sooner, and RAM use stays low.

  • Later (larger buffer): More diverse experience is stored for harder phases and finer policy learning.

Typical strategy:

  • Start small (50K total / 20K start) for rapid early training

  • Grow to 100K/75K at 5M frames for more diversity

  • Grow to 200K/150K at 7M frames for maximum diversity

Memory estimate (~10KB per transition):

  • 50K ≈ 500MB

  • 100K ≈ 1GB

  • 200K ≈ 2GB

Prioritized Experience Replay

See: Schaul et al. 2015 - Prioritized Experience Replay

Why Prioritized Experience Replay (PER)?

With uniform replay, all transitions are sampled with equal probability. Many of them are “easy”: the network already predicts them well (low TD-error), and training on them adds little. PER uses a priority proportional to how wrong the network was on that transition (e.g. absolute TD error between predicted and target Q). High-priority transitions are sampled more often, so the same buffer and the same number of batches are used mostly on transitions the agent still needs to learn from.

Benefits: Better sample efficiency — less waste on trivial transitions, more updates on informative ones. Can speed up learning when the distribution of TD-errors is very uneven.

Trade-offs: Prioritized sampling is biased (some transitions are over-represented). Importance sampling (prio_beta) corrects this in the loss; prio_epsilon ensures every transition keeps a non-zero chance of being sampled so no experience is completely ignored. Tuning PER (alpha, beta, epsilon) adds complexity and can cause instability, so it is often left off in favor of uniform replay.

In this project: prio_alpha=0 disables PER; sampling is uniform. The parameters below are available for experiments with prio_alpha > 0. Priorities are updated from the absolute error between predicted and target Q-values after each training step.

prio_alpha: float = 0.0

Priority exponent

Controls how much prioritization affects sampling. Priority is stored as (td_error + eps)^alpha; sampling probability is proportional to that value.

  • alpha=0.0: Uniform sampling (no prioritization). All transitions have equal probability.

  • alpha=1.0: Sampling probability proportional to priority (strong prioritization).

  • 0 < alpha < 1: Smooth interpolation; higher alpha → more focus on high-error transitions.

Paper recommendations:

  • Rainbow-IQN: 0.2

  • Rainbow: 0.5

  • PER: 0.6

Current: 0.0 (uniform) for simplicity and to avoid bias/instability from prioritization.

prio_epsilon: float = 0.002

Priority offset

Added to TD-error before computing priority: (td_error + prio_epsilon)^alpha. Ensures that every transition has a strictly positive priority and thus a non-zero chance of being sampled.

  • Prevents transitions from never being revisited.

  • Current: 2e-3 (keeps sampling reasonably uniform when alpha is small).

  • Typical range: 1e-6 to 1e-2.

prio_beta: float = 1.0

Importance sampling exponent

Prioritized sampling oversamples some transitions; the loss is weighted by inverse sampling probability (raised to beta) so that the expected gradient is unbiased.

  • beta=0.0: No correction (biased updates, prioritization effect is strongest).

  • beta=1.0: Full correction (unbiased). Often annealed from <1 to 1 over training in the PER paper.

Current: 1.0 (full correction; when prio_alpha=0 this has no effect because sampling is uniform).

Memory Usage Control

number_times_single_memory_is_used_before_discard: int = 32

Expected reuse per transition

Controls how many times each transition is expected to be used for training on average. This is a global limit, not a per-transition counter.

How it works:

  • When N new transitions are added to the buffer, the system “allows” N * number_times_single_memory_is_used_before_discard total uses across all training batches

  • Training continues while the global usage counter is below this limit

  • Transitions are removed from the buffer only when it overflows (FIFO), not based on individual usage counts

Interaction with PER:

When prio_alpha > 0 (PER enabled), high-priority transitions are sampled more often. This means:

  • High-priority transitions participate in more batches and contribute more to the global usage counter

  • However, they are not removed faster — removal is FIFO-based when the buffer overflows

  • Priorities are updated after each training step: as the network learns, TD-errors decrease, priorities drop, and sampling frequency self-balances

  • This creates a natural feedback loop: transitions that are hard to learn stay longer (high priority → frequent sampling → priority updates → if still hard, priority stays high)

Potential concern: Could high-priority transitions “consume” the usage budget faster, leaving less training for low-priority ones?

  • In practice, this is mitigated by the self-balancing mechanism: as transitions are learned, their priority decreases

  • prio_epsilon ensures all transitions have a non-zero sampling probability

  • The global limit ensures overall training frequency matches data collection rate

Typical values: 1-64. Higher values = more reuse per transition, better sample efficiency, but transitions may become stale if the policy changes significantly.

Exploration Configuration

Located in the exploration section of the config YAML.

The agent uses hybrid exploration: epsilon-greedy + epsilon-Boltzmann.

Epsilon-Greedy

epsilon_schedule: list = [(0, 1), (50000, 1), (300000, 0.1), (3000000, 0.03)]

Epsilon-greedy exploration schedule

Probability of taking completely random action.

Strategy:

  • Start at 1.0 (100% random) for buffer warmup

  • Maintain until 50K frames

  • Decay to 0.1 by 300K frames

  • Final decay to 0.03 by 3M frames

Interpretation:

  • epsilon=1.0: Pure exploration

  • epsilon=0.1: 90% policy, 10% random

  • epsilon=0.03: 97% policy, 3% random

Epsilon-Boltzmann

epsilon_boltzmann_schedule: list = [(0, 0.15), (3000000, 0.03)]

Epsilon-Boltzmann exploration schedule

Probability of Boltzmann sampling (when not taking random action).

How the action is chosen (one action per step):

  • Random (with probability epsilon): action is chosen uniformly among all actions (ignores Q-values).

  • Boltzmann (with probability (1−epsilon)×epsilon_boltzmann): to each Q(s,a) we add Gaussian noise, then take the action with the maximum of these noised values. So we still pick one action (argmax), but which one can differ from greedy because of the noise.

  • Greedy (with probability (1−epsilon)×(1−epsilon_boltzmann)): action with the maximum Q(s,a) is taken (no noise).

So the difference: greedy = always the best Q; Boltzmann = best Q after adding random noise (often the same action when tau is small, sometimes another).

Combined behavior (epsilon=0.1, epsilon_boltzmann=0.15):

  • 10% purely random

  • 90% × 15% = 13.5% Boltzmann (argmax of Q + noise)

  • 90% × 85% = 76.5% greedy (argmax of Q)

tau_epsilon_boltzmann: float = 0.01

Boltzmann temperature

In the implementation, Gaussian noise is added to Q-values before taking argmax: argmax(Q + tau * randn). So tau controls noise scale:

  • tau → 0: Almost no noise → almost always greedy (max Q)

  • tau large: Large noise → sometimes a suboptimal action wins

Current: 0.01 is very low (near-greedy)

Recommendations:

  • Low (0.01-0.1): Well-calibrated Q-values

  • High (0.5-1.0): Noisy Q-values, need more exploration

multi_action_exploration: str = "per_action"

How to apply epsilon-greedy when predicting N actions per block (only when n_actions_per_block > 1).

  • ``per_action`` (default): Each of the N actions is chosen independently with probability epsilon for a random action. So the block can mix greedy and random per time step.

  • ``per_block``: One draw per block: with probability epsilon the entire block is random (all N actions random), otherwise the entire block is greedy.

Rewards Configuration

Located in the rewards section of the config YAML.

Base Rewards

constant_reward_per_ms: float = -0.0012

Time penalty (per millisecond)

Fixed negative reward at every timestep to encourage speed.

  • Current: -6/5000 = -0.0012 per ms

  • Over 5 seconds: Accumulates -6 total

Balances with progress rewards to prevent reckless driving.

reward_per_m_advanced_along_centerline: float = 0.01

Progress reward (per meter)

Primary positive reward for forward progress along racing line.

  • Current: 5/500 = 0.01 per meter

  • Over 500 meters: Earns +5 total

Balance example (50 m/s speed over 10 seconds):

  • Distance: 500m → Progress: +5

  • Time: 10s → Time penalty: -12

  • Net: Must drive efficiently for positive reward

final_speed_reward_as_if_duration_s: float = 0

Bonus reward for gaining speed (scale as “virtual duration in seconds”)

When > 0, each transition gets an extra reward proportional to the increase in speed (norm of velocity): reward += final_speed_reward_per_m_per_s * (speed_now - speed_prev). The coefficient final_speed_reward_per_m_per_s is computed as reward_per_m_advanced_along_centerline * final_speed_reward_as_if_duration_s, so gaining 1 m/s is valued the same as advancing final_speed_reward_as_if_duration_s meters along the centerline.

Purpose: Encourage acceleration and maintaining high speed, not only covering distance. Useful if the agent under-exploits speed on straights.

  • 0 (default): No speed bonus; only progress and time penalty.

  • e.g. 2: Each m/s gained gives the same reward as 2 m of progress. Tune together with reward_per_m_advanced_along_centerline.

Shaped Rewards

Potential-based shaping (distance to current VCP and direction toward next). Uses shaped_reward_* and clamps distance with min/max. Currently disabled (0) by default.

shaped_reward_dist_to_cur_vcp: float = -0.1

Coefficient for potential based on distance to current checkpoint (VCP). Negative encourages getting closer. Used with shaped_reward_min_dist_to_cur_vcp and shaped_reward_max_dist_to_cur_vcp for clamping.

shaped_reward_min_dist_to_cur_vcp: float = 2

Minimum distance (m) to current VCP in the potential; distance is clamped to [min, max] before multiplying by shaped_reward_dist_to_cur_vcp.

shaped_reward_max_dist_to_cur_vcp: float = 25

Maximum distance (m) to current VCP in the potential (see above).

shaped_reward_point_to_vcp_ahead: float = 0

Coefficient for potential term that rewards pointing toward the next VCP (forward component of direction). 0 = disabled.

Engineered rewards (technique bonuses, schedule by frame count). All use schedules [[step, value], ...] and are disabled (0) by default.

engineered_reward_min_dist_to_cur_vcp: float = 5

Lower bound (m) for “close to VCP” engineered reward; distance clamped to [min, max].

engineered_reward_max_dist_to_cur_vcp: float = 25

Upper bound (m) for “close to VCP” engineered reward (see above).

engineered_speedslide_reward_schedule: list = [[0, 0]]

Schedule of reward per step for good speedslide (sliding) on tarmac. Format: [[frame_step, coefficient], ...].

engineered_neoslide_reward_schedule: list = [[0, 0]]

Schedule of reward when lateral speed is high (≥2 m/s), encouraging neoslide-style drifts.

engineered_kamikaze_reward_schedule: list = [[0, 0.0]]

Schedule of reward for “kamikaze” behavior (e.g. gas without steering in air or when wheels not all grounded).

engineered_close_to_vcp_reward_schedule: list = [[0, 0.0]]

Schedule of reward for being close to current VCP (distance clamped by engineered_reward_min_dist_to_cur_vcp / engineered_reward_max_dist_to_cur_vcp).

Tip

Default setup keeps a clean reward: “go fast, go forward”. If learning is too slow, consider enabling shaped_reward_dist_to_cur_vcp for denser checkpoint guidance, or engineered technique rewards after basic driving is learned.

Map Cycle Configuration

Located in the map_cycle section of the config YAML.

map_cycle: list

Map training cycle

Under map_cycle.entries in YAML, each entry has:

  • short_name (str): Logging identifier (e.g., “hock”, “A01”)

  • map_path (str): Path to .Challenge.Gbx file

  • reference_line_path (str): Racing line .npy file in maps/

  • is_exploration (bool): Use exploration strategy

  • fill_buffer (bool): Add experiences to replay buffer

  • repeat (int): How many times to repeat this entry in the cycle

Common patterns (YAML):

map_cycle:
  entries:
    # 4 exploration + 1 evaluation
    - {short_name: A01, map_path: "A01-Race.Challenge.Gbx", reference_line_path: "A01_0.5m_cl.npy", is_exploration: true, fill_buffer: true, repeat: 4}
    - {short_name: A01, map_path: "A01-Race.Challenge.Gbx", reference_line_path: "A01_0.5m_cl.npy", is_exploration: false, fill_buffer: true, repeat: 1}

Performance Configuration

Located in the performance section of the config YAML.

Parallelization

gpu_collectors_count: int = 4

Number of parallel TrackMania instances

More instances = faster data collection.

Tuning recommendations:

  1. Start with 2

  2. Monitor CPU/GPU/RAM usage

  3. Gradually increase until bottleneck

  4. Measure batches trained per minute

  5. Optimal: Maximizes throughput without instability

Typical values:

  • 4-core CPU: 2-4 collectors

  • 8-core CPU: 4-8 collectors

  • 16-core CPU: 8-16 collectors

Memory: ~2GB RAM per instance

max_rollout_queue_size: int = 1

Maximum size of the rollout queue per collector

Each collector process sends finished rollouts to the learner via a multiprocessing Queue. This option sets the maximum number of rollout results that can be buffered in that queue per collector.

Behavior:

  • 1 (default): Collector blocks after each rollout until the learner has taken the result. Tight coupling: learner and collectors stay in sync; no extra memory for queued rollouts.

  • > 1: Collector can push several rollouts into the queue before blocking. If the learner is temporarily slower, collectors can “get ahead” and reduce learner idle time; if collectors are slower, the learner may drain the queue and then wait.

When to increase:

  • Learner is often waiting for rollouts (e.g. “Learner waited Xs for workers”).

  • You have many collectors and want to smooth short bursts of slow rollouts.

Trade-off: Higher values use more memory (one queued item per collector per slot) and can increase latency between policy updates and the data used for training.

Effect on training speed:

  • 1: Learner and collectors are tightly coupled. If the learner is fast (e.g. small batch, few collectors), it may often wait for the next rollout; if collectors are fast, they may block on put() until the learner takes the data. Throughput is limited by whoever is slower at that moment.

  • 2–4: Small buffer smooths short imbalances: a collector can push 2–4 rollouts while the learner is busy, so the learner has work ready when it finishes a batch. Often gives a modest speedup (fewer “Learner waited for workers” pauses) with little extra memory.

  • Large (e.g. 8+): More buffering can help only if the learner is consistently faster than collection (e.g. many collectors, very fast learner). Otherwise the queue rarely fills, and you mainly pay in memory and slightly staler data.

Recommendation: Keep 1 by default. If logs show the learner often waiting several seconds for workers, try 2 or 4 and compare batches per minute; going beyond 4 is rarely needed.

running_speed: int = 160

Game simulation speed multiplier

Run game faster than real-time for rapid data collection.

  • 1: Real-time (for debugging/visualization)

  • 10-50: Fast training with visual observation

  • 100-200: Maximum speed (no visual observation)

Current: 160× real-time

Warning

Too fast may cause physics instability or inaccurate simulation.

Network Synchronization

Training uses one learner process (updates the policy) and several collector processes (run the game and select actions with the current policy). The learner and collectors share the same network weights via a shared network in shared memory. Two parameters control how often weights are pushed from the learner and pulled by the collectors.

Data flow:

  1. Learner trains online_network every batch.

  2. Periodically the learner copies online_networkshared network (in shared memory). This is the push.

  3. Each collector has its own local inference network used to choose actions.

  4. The collector copies shared networkinference network at the start of each rollout and periodically during long rollouts. This is the pull.

send_shared_network_every_n_batches: int = 8

How often the learner pushes new weights to the shared network

Every N training batches, the learner copies the current online_network weights into the shared network (in shared memory). Collectors read from this shared copy when they update their local inference network.

How it works:

  • After each batch, the learner checks cumul_number_batches_done % send_shared_network_every_n_batches == 0

  • When true, it does: shared_network.load_state_dict(online_network.state_dict()) under a lock

  • Collectors never write to the shared network; they only read from it when they pull

Why it matters:

  • Larger (e.g. 16–32): Fewer copies, less lock contention, slightly less up-to-date policy in collectors

  • Smaller (e.g. 2–4): Collectors see new weights more often, but more frequent locking and copy cost

Trade-off: Balance between “collectors use fresh policy” and “learner is not blocked by shared-memory writes”. With 8, collectors are at most 8 batches behind; at ~512 batch size that is a few thousand transitions.

Current: 8 batches. Typical range: 4–16.

update_inference_network_every_n_actions: int = 8

How often each collector pulls the shared network into its local inference network during a rollout

Each collector updates its local inference network from the shared network at the start of every rollout. During a long rollout (e.g. a 2-minute race), it also updates every N actions (e.g. every 8 actions). So the policy used for action selection can be refreshed mid-race without waiting for the next rollout.

How it works:

  • At rollout start: collector always calls update_network() (shared → inference) once before driving

  • During the rollout: the game loop runs with _time (race time in milliseconds). Every 10 * run_steps_per_action * update_inference_network_every_n_actions ms, the collector calls update_network() again. That product equals ms_per_action * update_inference_network_every_n_actions (with 50 ms per action), so the interval is N actions in time.

  • With run_steps_per_action = 5, ms_per_action = 50 and update_inference_network_every_n_actions = 8: interval = 50 × 8 = 400 ms = 8 actions

Why it matters:

  • Larger (e.g. 16–32): Fewer copies during a run; inference network may be several hundred actions behind the learner

  • Smaller (e.g. 2–4): Inference network stays closer to the shared (and thus learner) policy during long races; more copy and lock usage

When it helps: Long rollouts (e.g. 1–2 min). If you update only at rollout start, the first half of the race uses weights that may be hundreds of batches old by the time the rollout ends. Updating every N actions keeps inference policy closer to the current learner policy.

Current: 8 actions. With 50 ms per action, that is an update every 400 ms during a rollout. Typical range: 4–16.

Summary: send_shared_network_every_n_batches controls how often the learner updates the shared copy. update_inference_network_every_n_actions controls how often each collector refreshes its local policy from that shared copy during a single run. Both are trade-offs between freshness of the policy used for data collection and the cost of copying/locking.

Visualization and Analysis

make_highest_prio_figures: bool = False

Save images of highest-priority transitions (only when PER is enabled)

When True and the buffer uses Prioritized Experience Replay (prio_alpha > 0), the learner periodically saves PNG images of the transitions that have the highest priority in the replay buffer. These are written to save_dir / "high_prio_figures", not to TensorBoard.

What the figures show:

  • For each of the top 20 transitions by priority (i.e. by TD-error), the code saves a small window of transitions: indices from high_error_idx - 4 to high_error_idx + 5

  • Each saved image is one transition: state frame and next_state frame side by side (horizontally), upscaled 4× for visibility

  • Filename format: {high_error_idx}_{idx}_{n_steps}_{priority:.2f}.png (e.g. 123_120_3_0.45.png)

What you can infer from them:

  • Which situations get high TD-error: High priority ≈ “network was most wrong on this transition”. Looking at the frames tells you where (e.g. sharp turn, specific surface, near wall) and what (state → next_state) the agent is struggling to predict.

  • Clustering: If many high-priority transitions look similar (e.g. same turn, same surface), the agent may need more data or reward shaping there.

  • Debugging PER and rewards: Helps check that high TD-error corresponds to “hard” or “surprising” situations rather than noise or bugs.

When it runs: Only at checkpoint save time (when the learner saves weights and stats), and only if get_config().make_highest_prio_figures is True and the buffer uses PrioritizedSampler. If prio_alpha = 0 (uniform replay), this option has no effect.

TensorBoard: These figures are not sent to TensorBoard. They are only written as PNG files under save_dir / "high_prio_figures". To inspect them in TensorBoard you would need to add custom logging (e.g. writer.add_image(...)) yourself.

Performance: Generating them is relatively slow (iterating over the buffer and saving many images), so they are disabled by default. Enable only for occasional debugging or analysis.

Advanced Topics

Schedules

All schedules use linear interpolation between points:

schedule = [
    (0, 1.0),       # Start at 1.0
    (1000, 0.5),    # Linear decay to 0.5 at 1000 frames
    (2000, 0.1),    # Linear decay to 0.1 at 2000 frames
]

# At frame 500: value = 0.75 (interpolated)
# At frame 1500: value = 0.3 (interpolated)

Changing Configuration

Config is loaded once at process startup (from the YAML path passed to train.py --config). To change parameters you must edit the YAML file and restart training. A snapshot of the config used for each run is saved as config_snapshot.yaml in save/{run_name}/.

Warning

Don’t change mid-run:

  • Network architecture parameters

  • Input dimensions

  • Action space

These require restarting training.

Troubleshooting

Training is slow

  • Increase gpu_collectors_count

  • Increase running_speed

  • Reduce batch_size for more frequent updates

  • Disable visualization options

Agent gets stuck

  • Check cutoff_rollout_if_no_vcp_passed_within_duration_ms

  • Verify map reference line is correct

  • Increase exploration (higher epsilon)

Memory issues

  • Reduce memory_size_schedule values

  • Reduce gpu_collectors_count

  • Reduce batch_size

Further Reading