PPO actor-critic architecture

This page documents the policy and value network used for on-policy policy optimization with a discrete shared-trunk actor-critic. The same stacks and ppo_wiring factory are used when training.algorithm is ppo, dpo, or grpo; this page details the network (Variants A/B/C) and the PPO training loop (GAE, clipped surrogate, value loss). DPO keeps the same bodies but trains from preference pairs; GRPO uses group-relative trajectory returns — see GRPO: network and training.

When training.algorithm is ppo specifically, you get a shared-trunk actor-critic as described below. Implementation lives under trackmania_rl.agents.policy_models and is wired via trackmania_rl.agents.algorithms.ppo_wiring.

For the value-based baseline (quantile IQN + replay), see IQN architecture. BTR (BTR options (IQN + paper extras)) applies to IQN only. Variant A PPO reads nn.vis.cnn (same _build_img_head flags as IQN, without merging btr:). Fusion variants that use the CNN vision branch (infer_vis_branchcnn) also call _build_img_head with kwargs resolved from nn.vis.cnn — the same single source as IQN and Variant A PPO (trackmania_rl/nn_build/vis_cnn_head.py). TorchMultimodalActorCritic (without policy heads) backs IQN when training.algorithm is iqn and nn.fusion_mode != none.

YAML knobs for PPO routing and vision (nn.fusion_mode, nn.vis, nn.float, nn.encoder): Neural network YAML (nn) — full reference in Configuration Guide. Optional RL parameter freeze (e.g. nn.vis.freeze, nn.encoder.freeze, nn.decoder.shared_trunk_freeze) applies to PPO the same way as to IQN where documented.

Why this stack is shaped this way

Image + float inputs. TrackMania gives both rendered frames and a normalized float vector (geometry, speed, gear, …). Vision learns what the road looks like; the float path carries signals that are tedious to infer from pixels alone. Why both feed one trunk: a single representation is used for action logits and for the value baseline, so both tasks co-adapt the same features.

Shared trunk, two heads (policy + value). The policy head defines a categorical over discrete actions (including multi-offset layouts). The value head predicts expected return from each state. Why actor-critic: the critic feeds GAE and the value loss, which reduces variance of policy gradients compared to pure Monte Carlo returns.

On-policy PPO loop. Data are generated with the current policy, then discarded after a few epochs of updates. Why not replay (here): keeps the off-policy correction simple; PPO’s clipped ratio explicitly limits change w.r.t. the policy that collected the batch, which stabilizes training when rewards are dense and correlated.

Collectors vs learner. Environments run in parallel processes; inference must be fast. Collectors only forward + sample and enqueue lists; the learner does backward + optimizer on aggregated GPU batches. Why store ``ppo_log_probs``: PPO’s ratio compares \(\pi_\theta\) to \(\pi_{\mathrm{old}}\) — the policy that actually produced the actions on the rollout.

GAE (:math:`gamma`, :math:`lambda`). Trade off bias vs variance of advantage estimates using the value function and n-step structure. Why: raw one-step TD is noisy; full Monte Carlo is high-variance; GAE interpolates.

Clipped surrogate. Penalty if the new policy assigns much more probability to the taken actions than the behavior policy did. Why: approximate trust region — large policy jumps on one minibatch tend to break the on-policy assumption.

Value loss + entropy bonus. The critic is trained toward return targets; the entropy term discourages premature collapse to deterministic actions. Why: without entropy, exploration from stochastic sampling fades as logits sharpen.

Routing: which network is built?

ppo_wiring.make_network chooses exactly one implementation (first match wins). For an uncompiled policy on CPU (e.g. BC with bc_use_rl_architecture), the same routing is implemented by ppo_wiring.build_ppo_policy_uncompiled (no torch.compile / forced CUDA).

  1. If get_config().transformers.fusion_mode (i.e. nn.fusion_mode) is not noneVariant CTorchMultimodalActorCritic (native torch.nn.TransformerEncoder stacks; HF vision only inside vision_transformer when nn.vis.transformer.use_hf_backbone).

  2. Else if nn.vis.transformer is set and use_hf_backbone is true → Variant BHfActorCritic (Hugging Face AutoModel CLS + float MLP + shared trunk).

  3. Else → Variant APpoActorCritic (nn.vis.cnn image head via the same _build_img_head kwargs as IQN, or float-only if no_image / no CNN).

Why three variants. A is the default conv + MLP path (fast, full control of CNN flags). B plugs in a pretrained HF vision backbone when you want transfer from large-scale image pretraining. C uses fusion transformers so image and float features interact through attention (and optional hub round-trip), instead of a single early concat — useful when alignment between modalities is subtle.

Note

If fusion_mode: none but YAML declares only nn.vis.transformer with use_hf_backbone: false (no cnn), the CNN branch sees no image stem → float-only PPO (zeros image tensor at inference). For CNN PPO, keep nn.vis.cnn.

Training stack (processes and modules)

scripts/train.py starts a learner process and several collector processes. For training.algorithm: ppo, the learner runs learner_ppo; collectors attach PPOInferer and push rollouts into multiprocessing queues. The same policy weights exist as a compiled CUDA module in collectors and as trainable parameters in the learner; after each PPO update the learner copies state dict into the shared uncompiled copy under shared_network_lock (collectors refresh their view from that copy — same pattern as IQN’s weight sync).

digraph ppo_process_stack {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   train [label="scripts/train.py", style="rounded,filled", fillcolor=lightcyan];
   lp [label="learner_process.py\nif algorithm == ppo → learner_ppo", style="filled", fillcolor=lightyellow];
   cp [label="collector_process.py × N\nis_policy_optimization_algorithm()", style="filled", fillcolor=lightyellow];
   inf [label="PPOInferer\n(forward + sample + log p, V)"];
   lppo [label="learner_ppo.py\nrollout batch → GAE → PPO loss → Adam"];
   pol [label="policy network\n(make_network)", style="filled", fillcolor=lightgreen];
   sh [label="uncompiled_shared_network\n+ shared_network_lock", style="filled", fillcolor=lightpink];
   q [label="rollout_queues\n(multiprocessing)"];
   train -> lp;
   train -> cp;
   lp -> lppo;
   cp -> inf;
   inf -> pol;
   lppo -> pol;
   inf -> q [label="put"];
   q -> lppo [label="get"];
   lppo -> sh [label="load_state_dict"];
   inf -> sh [style=dashed, label="weights for inference"];
}

Registry: training.algorithm: ppo resolves to trackmania_rl.agents.algorithms.ppo_wiring via registry.get_wiring() (same module also serves DPO/GRPO for network build only).

``uncompiled_shared_network`` and the lock. After each update the learner writes weights into a shared module; collectors read that snapshot for inference. Why: one authoritative weight tensor for many parallel games without training inside env processes.

Overview

Like IQN, the model consumes two branches:

  • Image (B, 1, H, W) — grayscale frame (or a zero tensor if the image head is disabled);

  • Float (B, float_input_dim) — the same normalized state vector as IQN (waypoints, gear, velocity, etc.).

Outputs:

  • Policy: logits for a categorical distribution over actions. Single-decision mode: (B, n_actions). Multi-action mode (rl_action_offsets_ms with more than one offset): (B, n_actions_per_block * n_actions) reshaped to (B, N, n_actions) inside evaluate_actions.

  • Value: scalar V(s) per sample, (B, 1) before squeeze.

Float normalization \((x-\mu)/\sigma\) (running buffers) matches IQN so BC / IQN / PPO can share statistics. Why: stable MLP inputs when raw speeds and distances have different scales.

digraph ppo_overview {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=10];
   img [label="img\n(B,1,H,W)", style="filled", fillcolor=lightblue];
   flt [label="float_inputs\n(B,F)", style="filled", fillcolor=lightblue];
   cnn [label="Image head\n(optional CNN)"];
   mlp [label="Float MLP\n2×Linear+ReLU"];
   nrm [label="(x−μ)/σ\nper feature"];
   cat [label="Concat\n(B, D_vis + D_float)"];
   trk [label="Trunk\n2×Linear+ReLU"];
   pi [label="policy_head\nLinear → logits", style="filled", fillcolor=lightyellow];
   v [label="value_head\nLinear → V(s)", style="filled", fillcolor=lightyellow];

   img -> cnn;
   flt -> nrm -> mlp;
   cnn -> cat;
   mlp -> cat;
   cat -> trk -> pi;
   trk -> v;
}

Variant A: CNN actor-critic

Class: PpoActorCritic in ppo_actor_critic.py. Example YAML: config_ppo_cnn_mlp.yaml (minimal) or config_ppo.yaml with nn.fusion_mode: none and nn.vis.cnn.

Image branch

If nn.vis.no_image is false and nn.vis.cnn is present, the stem calls the same _build_img_head as IQN (trackmania_rl/agents/iqn.py) with flags taken directly from nn.vis.cnn: use_impala_cnn, impala_model_size, use_spectral_norm, use_adaptive_maxpool, adaptive_maxpool_size. The conv output is flattened to conv_head_output_dim.

Unlike IQN, this path does not read btr: — only nn.vis.cnn. (BTR is an IQN-only bundle.)

If no_image is true or there is no CNN stem, img_head is omitted and the trunk input is float-only.

Float branch

  1. Normalize with buffers float_inputs_mean / float_inputs_std.

  2. Two linear layers with ReLU: float_input_dim float_hidden_dim float_hidden_dim.

Width float_hidden_dim comes from get_config().float_hidden_dimnn.float.mlp.hidden_dim (encoder.mlp override applies to fusion PPO only, not this variant).

Fusion and trunk

  • With image: h = concat(CNN(img), float_MLP(float)).

  • Trunk: Linear ReLU Linear ReLU with width dense_hidden_dimension.

Heads

  • policy_head: dense_hidden_dimension n_actions * n_actions_per_block.

  • value_head: dense_hidden_dimension 1.

At inference and training, evaluate_actions computes log-probability and entropy from the categorical defined by logits (product of N categoricals in multi-action mode).

digraph ppo_evaluate {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   fwd [label="forward(img, float)\n→ logits, value"];
   rs [label="reshape logits\n(B,N,A) if multi-action"];
   cat [label="Categorical(logits)\nper head / factor"];
   out [label="log π(a|s), H[π], V(s)", style="filled", fillcolor=lightgreen];
   fwd -> rs -> cat -> out;
}

Variant B: Hugging Face vision backbone

Enabled when nn.fusion_mode is none and nn.vis.transformer.use_hf_backbone is true (requires pip install -e ".[policy]"). Class: HfActorCritic in hf_actor_critic.py.

Factory: make_hf_ppo_network_pair in ppo_wiring.make_network.

digraph ppo_hf {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=10];
   img [label="img\n(B,1,H,W)", style="filled", fillcolor=lightblue];
   flt [label="float_inputs\n(B,F)", style="filled", fillcolor=lightblue];
   prep [label="resize / RGB /\nprocessor norm"];
   vit [label="HF AutoModel\n(CLS token)"];
   fmlp [label="Float MLP +\nLinear → hidden"];
   cat [label="Concat\n(B, 2·H)"];
   trk [label="Trunk + heads\n(same idea as CNN PPO)"];
   img -> prep -> vit -> cat;
   flt -> fmlp -> cat;
   cat -> trk;
}

Pixels are interpolated to the processor’s height/width, duplicated to 3 channels if needed, mapped from [-1,1] to [0,1], then normalized with the processor’s mean/std when available. Float features use the same two-layer MLP as the CNN variant, then a linear projection to the backbone hidden_size so that image CLS and float embeddings are concatenated before the shared trunk.

Variant C: Multimodal fusion (nn.fusion_mode none)

When nn.fusion_mode is one of vision_transformer, post_concat, or unified, ppo_wiring.make_network builds TorchMultimodalActorCritic (multimodal_torch_fusion.py). The multimodal bundle exposed as get_config().transformers combines nn.fusion_mode, nn.init_from_pretrained, and nn.encoder.transformer (plus nn.vis.transformer for the image side).

Common after fusion: Linear ReLU Linear ReLU trunk of width nn.decoder.dense_hidden_dimension (same config field name as IQN; PPO reads it as dense_hidden_dim), then policy / value linear heads.

Float width: float_hidden_dim_effective() = nn.encoder.mlp.hidden_dim if set, else nn.float.mlp.hidden_dim.

Sub-modes

vision_transformer

Image → either (a) native PatchEmbed2d + nn.TransformerEncoder on patch tokens + mean-pool, using nn.vis.transformer (d_model, n_layers, n_heads, ff_mult, dropout, patch_size), or (b) HF backbone (CLS) + optional vis_refine encoder when use_hf_backbone: true (requires transformers). Float → two-layer MLP. Fusion: concat(image_emb, float_emb) → bridge Linear to dense_hidden_dim → trunk.

post_concat

Image → if use_image_head and the vision branch is CNN, _build_img_head with flags from nn.vis.cnn (IMPALA / adaptive pool / spectral norm as configured). Native patch or HF vision use nn.vis.transformer instead. Float → fused_vector layout: two-layer MLP (width float_hidden_dim_effective()), then concat with the vision vector and projection to a token sequence (length nn.encoder.transformer.post_concat_seq_len). token_sequence layout (e.g. float_token_layout: per_feature) uses raw float tokens without that MLP. Then learned positions, fusion nn.TransformerEncoder from nn.encoder.transformer (when not linear), pool → bridge → trunk.

Hub round-trip: fusion save_pretrained / from_pretrained JSON may include rulka_transformers.vis_cnn (dump of nn.vis.cnn) so CNN stems match after reload; older hubs without vis_cnn fall back to the baseline 4-conv kwargs for the CNN branch.

unified

Single joint encoder over image token(s) and learned float token(s) (unified_float_tokens). No separate float MLP in this mode (float raw features go through float_to_tokens). Native patch vision: vis.transformer.d_model must equal encoder.transformer.d_model (schema). CNN vision contributes one token; HF vision contributes N patch tokens (N inferred from the HF backbone when the model is built). Fusion trunk: same fusion_encoder options as other multimodal modes (default native_transformer when encoder is not HF; else hf_embedding per infer_fusion_encoder).

digraph ppo_fusion_modes {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   m1 [label="vision_transformer:\nimage (patch or HF CLS) + float MLP\n→ concat → bridge"];
   m2 [label="post_concat:\nCNN ∥ float MLP → tokenize\n→ Enc_fusion → pool → bridge"];
   m3 [label="unified:\npatch tokens ∥ float tokens\n→ Enc_fusion → pool → bridge"];
   tr [label="Trunk + policy / value heads", style="filled", fillcolor=lightgreen];
   m1 -> tr;
   m2 -> tr;
   m3 -> tr;
}

Patch geometry: nn.vis.transformer.patch_size must divide H_downsized and W_downsized for native vision_transformer and unified. post_concat ignores patch size on the image side (CNN stem).

Optional warm start: nn.init_from_pretrained (Rulka fusion save_pretrained directory) after build; trust flags follow nn.encoder.transformer.trust_remote_code.

Example YAML for post_concat + HF two-tower fusion: config_ppo_transformer.yaml. For native vision_transformer (patch + Linear fuse, no HF), start from config_ppo.yaml and set nn.fusion_mode: vision_transformer with nn.vis.transformer.use_hf_backbone: false. See Neural network YAML (nn) — full reference in Configuration Guide.

IQN vs PPO (same inputs, different heads)

Aspect

IQN (IQN architecture)

PPO (this page)

Output

Distributional Q(s,a,τ) via quantile embedding + dueling

π(a|s) logits + V(s)

Training

Replay buffer, n-step, quantile Huber, target network

On-policy rollouts, GAE, clipped surrogate, no replay

Exploration

ε-greedy / Boltzmann / NoisyNet (config)

Stochastic policy sample from Categorical

Training flow (high level) — training.algorithm: ppo only

The following loop runs in trackmania_rl.multiprocess.learner_ppo when the algorithm is PPO. It does not apply to DPO or GRPO (those learners reuse collectors and often the same rollout tensor builder, but substitute their own losses).

digraph ppo_train {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   col [label="Collectors:\nPPOInferer\nnetwork(img, float)"];
   q [label="Rollout queues:\nlog p, V, states, actions"];
   rew [label="ppo_rewards:\nvectorized rewards +\npotential shaping (γΦ'−Φ)"];
   gae [label="learner_ppo:\nGAE → Â, R"];
   loss [label="PPO loss:\nclip + c_v·L_V − c_e·H"];
   col -> q -> rew -> gae -> loss;
}

Per-step training objective (schematic): the learner minimizes a sum of clipped policy surrogate, value error (often with clipping vs old values), and negative entropy (i.e. entropy bonus). ppo_loss_components in trackmania_rl/agents/policy_optimization/ppo.py implements the algebra; exact coefficients and schedules come from ppo: in YAML (PPO configuration (ppo:) in Configuration Guide).

digraph ppo_loss_schematic {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   inp [label="batch:\nlog π_old, V_old, a, r, done\n+ forward: log π_θ, V_θ", style="filled", fillcolor=lightblue];
   rat [label="ratio r = exp(log π_θ − log π_old)"];
   clip [label="L_clip = min(r·Â, clip(r)·Â)"];
   lv [label="L_V: MSE or clipped value vs returns R"];
   ent [label="entropy H[π_θ]\n−c_e · mean(H)"];
   sum [label="loss = −L_clip + c_v·L_V − c_e·H\n(minimize)", style="filled", fillcolor=lightgreen];
   inp -> rat -> clip -> sum;
   inp -> lv -> sum;
   inp -> ent -> sum;
}

Reading the loss diagram:

  • Clipped policy branch: if the ratio \(r\) moves outside \([1-\varepsilon, 1+\varepsilon]\), the objective flattens — so the update does not over-reward actions that the new policy already exploits much more than the old one.

  • Value branch: pulls \(V_\theta\) toward returns (optionally clipped to old values) — so the critic tracks what will happen from each state, which in turn makes GAE’s advantages more accurate.

  • Entropy branch: subtracts mean entropy from the loss (i.e. maximize entropy) — so sampling stays diverse early in training.

Advantage flow: compute_gae consumes step rewards, V(s_t) at collection time, dones, and a bootstrap value at the rollout tail; it outputs per-step advantages \(\hat{A}_t\) and returns \(R_t\) for the value target.

digraph ppo_gae_flow {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=10];
   r [label="rewards r_t", style="filled", fillcolor=lightblue];
   v [label="values V(s_t)", style="filled", fillcolor=lightblue];
   d [label="dones", style="filled", fillcolor=lightblue];
   gae [label="compute_gae\n(γ, λ)"];
   out [label="Â_t , R_t", style="filled", fillcolor=lightgreen];
   r -> gae;
   v -> gae;
   d -> gae;
   gae -> out;
}

Rollout payload (per episode chunk): collectors enqueue Python lists the learner turns into GPU tensors. DPO/GRPO reuse the same keys for the shared builder (policy_rollout_batch).

digraph ppo_rollout_payload {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=10];
   col [label="collector\nPPOInferer step", style="filled", fillcolor=lightyellow];
   f [label="frames[]"];
   sf [label="state_float[]"];
   a [label="actions[]"];
   lp [label="ppo_log_probs[]"];
   pv [label="ppo_values[]"];
   q [label="rollout_queue\n+ end_race_stats", style="filled", fillcolor=lightpink];
   col -> f -> q;
   col -> sf -> q;
   col -> a -> q;
   col -> lp -> q;
   col -> pv -> q;
}

Why every queue field matters: frames / state_float reconstruct \(s_t\); actions are the labels for \(\log\pi(a|s)\); ppo_log_probs are \(\log\pi_{\mathrm{old}}\) at collection time; ppo_values bootstrap GAE and the value loss. end_race_stats drives logging and some reward edge cases (e.g. finish flags).

  1. Collectors run the compiled (or eager) actor-critic on CUDA; append ppo_log_probs, ppo_values, frames, state_float, actions.

  2. Learner aggregates rollouts until ppo.rollout_steps_per_update, builds tensors on GPU, computes rewards aligned with IQN’s dense + engineered terms (reward_vectorized + fold; see rollout_rewards.py).

  3. GAE uses scheduled γ / λ (and optional ppo_*_schedule in config).

  4. Optimizer updates the same network used in collectors; weights are copied to the shared inference copy under a lock.

Key design notes

  • Float inputs are always used when float_input_dim > 0; they are not auxiliary metadata. With image off, the policy is float-only.

  • Shared trunk means gradients from policy and value both affect CNN/float representations (unless you freeze modules elsewhere).

  • Schedules: learning rate and optional PPO coefficients can follow piecewise schedules on the global frame counter (see configuration guide).

Implementation references

  • trackmania_rl/agents/policy_models/ppo_actor_critic.py — CNN PPO network.

  • trackmania_rl/agents/policy_models/multimodal_torch_fusion.py — native Transformer multimodal fusion (nn.fusion_mode / get_config().transformers); IQN reuses it without policy heads.

  • trackmania_rl/nn_build/vis_cnn_head.py — kwargs for _build_img_head from nn.vis.cnn (IQN, PPO CNN, multimodal CNN branch, BC, pretrain Level 0 when rl_config_path is used).

  • trackmania_rl/agents/policy_models/hf_actor_critic.py — HF backbone PPO.

  • trackmania_rl/agents/algorithms/ppo_wiring.py — factory, PPOInferer, compile warmup hook.

  • trackmania_rl/agents/policy_optimization/ppo.py — GAE, clipped loss.

  • trackmania_rl/agents/policy_optimization/rollout_rewards.py — full TM rewards for PPO.

  • trackmania_rl/reward_vectorized.py — shared dense reward + potentials.

  • trackmania_rl/multiprocess/learner_ppo.py — PPO learner loop.

  • trackmania_rl/multiprocess/collector_process.py — attaches PPOInferer for any policy-optimization algorithm (ppo, dpo, grpo).

See also