PPO actor-critic architecture
This page documents the policy and value network used for on-policy policy
optimization with a discrete shared-trunk actor-critic. The same stacks
and ppo_wiring factory are used when training.algorithm is ppo,
dpo, or grpo; this page details the network (Variants A/B/C) and
the PPO training loop (GAE, clipped surrogate, value loss). DPO keeps the
same bodies but trains from preference pairs; GRPO uses group-relative
trajectory returns — see GRPO: network and training.
When training.algorithm is ppo specifically, you get a shared-trunk
actor-critic as described below. Implementation lives under
trackmania_rl.agents.policy_models and is wired via
trackmania_rl.agents.algorithms.ppo_wiring.
For the value-based baseline (quantile IQN + replay), see IQN architecture.
BTR (BTR options (IQN + paper extras)) applies to IQN only. Variant A PPO reads
nn.vis.cnn (same _build_img_head flags as IQN, without merging btr:).
Fusion variants that use the CNN vision branch (infer_vis_branch → cnn)
also call _build_img_head with kwargs resolved from nn.vis.cnn — the same
single source as IQN and Variant A PPO (trackmania_rl/nn_build/vis_cnn_head.py).
TorchMultimodalActorCritic (without policy heads) backs IQN when training.algorithm is iqn and nn.fusion_mode != none.
YAML knobs for PPO routing and vision (nn.fusion_mode, nn.vis, nn.float, nn.encoder): Neural network YAML (nn) — full reference in Configuration Guide. Optional RL parameter freeze (e.g. nn.vis.freeze, nn.encoder.freeze, nn.decoder.shared_trunk_freeze) applies to PPO the same way as to IQN where documented.
Why this stack is shaped this way
Image + float inputs. TrackMania gives both rendered frames and a normalized float vector (geometry, speed, gear, …). Vision learns what the road looks like; the float path carries signals that are tedious to infer from pixels alone. Why both feed one trunk: a single representation is used for action logits and for the value baseline, so both tasks co-adapt the same features.
Shared trunk, two heads (policy + value). The policy head defines a categorical over discrete actions (including multi-offset layouts). The value head predicts expected return from each state. Why actor-critic: the critic feeds GAE and the value loss, which reduces variance of policy gradients compared to pure Monte Carlo returns.
On-policy PPO loop. Data are generated with the current policy, then discarded after a few epochs of updates. Why not replay (here): keeps the off-policy correction simple; PPO’s clipped ratio explicitly limits change w.r.t. the policy that collected the batch, which stabilizes training when rewards are dense and correlated.
Collectors vs learner. Environments run in parallel processes; inference must be fast. Collectors only forward + sample and enqueue lists; the learner does backward + optimizer on aggregated GPU batches. Why store ``ppo_log_probs``: PPO’s ratio compares \(\pi_\theta\) to \(\pi_{\mathrm{old}}\) — the policy that actually produced the actions on the rollout.
GAE (:math:`gamma`, :math:`lambda`). Trade off bias vs variance of advantage estimates using the value function and n-step structure. Why: raw one-step TD is noisy; full Monte Carlo is high-variance; GAE interpolates.
Clipped surrogate. Penalty if the new policy assigns much more probability to the taken actions than the behavior policy did. Why: approximate trust region — large policy jumps on one minibatch tend to break the on-policy assumption.
Value loss + entropy bonus. The critic is trained toward return targets; the entropy term discourages premature collapse to deterministic actions. Why: without entropy, exploration from stochastic sampling fades as logits sharpen.
Routing: which network is built?
ppo_wiring.make_network chooses exactly one implementation (first match wins).
For an uncompiled policy on CPU (e.g. BC with bc_use_rl_architecture), the same
routing is implemented by ppo_wiring.build_ppo_policy_uncompiled (no torch.compile / forced CUDA).
If
get_config().transformers.fusion_mode(i.e.nn.fusion_mode) is notnone→ Variant C —TorchMultimodalActorCritic(nativetorch.nn.TransformerEncoderstacks; HF vision only insidevision_transformerwhennn.vis.transformer.use_hf_backbone).Else if
nn.vis.transformeris set anduse_hf_backboneis true → Variant B —HfActorCritic(Hugging FaceAutoModelCLS + float MLP + shared trunk).Else → Variant A —
PpoActorCritic(nn.vis.cnnimage head via the same_build_img_headkwargs as IQN, or float-only ifno_image/ no CNN).
Why three variants. A is the default conv + MLP path (fast, full control of CNN flags). B plugs in a pretrained HF vision backbone when you want transfer from large-scale image pretraining. C uses fusion transformers so image and float features interact through attention (and optional hub round-trip), instead of a single early concat — useful when alignment between modalities is subtle.
Note
If fusion_mode: none but YAML declares only nn.vis.transformer with use_hf_backbone: false (no cnn), the CNN branch sees no image stem → float-only PPO (zeros image tensor at inference). For CNN PPO, keep nn.vis.cnn.
Training stack (processes and modules)
scripts/train.py starts a learner process and several collector processes.
For training.algorithm: ppo, the learner runs learner_ppo; collectors attach
PPOInferer and push rollouts into multiprocessing queues. The same policy
weights exist as a compiled CUDA module in collectors and as trainable parameters
in the learner; after each PPO update the learner copies state dict into the
shared uncompiled copy under shared_network_lock (collectors refresh their
view from that copy — same pattern as IQN’s weight sync).
![digraph ppo_process_stack {
rankdir=TB;
node [shape=box, fontname="Helvetica", fontsize=10];
train [label="scripts/train.py", style="rounded,filled", fillcolor=lightcyan];
lp [label="learner_process.py\nif algorithm == ppo → learner_ppo", style="filled", fillcolor=lightyellow];
cp [label="collector_process.py × N\nis_policy_optimization_algorithm()", style="filled", fillcolor=lightyellow];
inf [label="PPOInferer\n(forward + sample + log p, V)"];
lppo [label="learner_ppo.py\nrollout batch → GAE → PPO loss → Adam"];
pol [label="policy network\n(make_network)", style="filled", fillcolor=lightgreen];
sh [label="uncompiled_shared_network\n+ shared_network_lock", style="filled", fillcolor=lightpink];
q [label="rollout_queues\n(multiprocessing)"];
train -> lp;
train -> cp;
lp -> lppo;
cp -> inf;
inf -> pol;
lppo -> pol;
inf -> q [label="put"];
q -> lppo [label="get"];
lppo -> sh [label="load_state_dict"];
inf -> sh [style=dashed, label="weights for inference"];
}](../_images/graphviz-4de800a1330c9ccded8d228bb541e43c1aafe989.png)
Registry: training.algorithm: ppo resolves to trackmania_rl.agents.algorithms.ppo_wiring via registry.get_wiring() (same module also serves DPO/GRPO for network build only).
``uncompiled_shared_network`` and the lock. After each update the learner writes weights into a shared module; collectors read that snapshot for inference. Why: one authoritative weight tensor for many parallel games without training inside env processes.
Overview
Like IQN, the model consumes two branches:
Image
(B, 1, H, W)— grayscale frame (or a zero tensor if the image head is disabled);Float
(B, float_input_dim)— the same normalized state vector as IQN (waypoints, gear, velocity, etc.).
Outputs:
Policy: logits for a categorical distribution over actions. Single-decision mode:
(B, n_actions). Multi-action mode (rl_action_offsets_mswith more than one offset):(B, n_actions_per_block * n_actions)reshaped to(B, N, n_actions)insideevaluate_actions.Value: scalar
V(s)per sample,(B, 1)before squeeze.
Float normalization \((x-\mu)/\sigma\) (running buffers) matches IQN so BC / IQN / PPO can share statistics. Why: stable MLP inputs when raw speeds and distances have different scales.
![digraph ppo_overview {
rankdir=LR;
node [shape=box, fontname="Helvetica", fontsize=10];
img [label="img\n(B,1,H,W)", style="filled", fillcolor=lightblue];
flt [label="float_inputs\n(B,F)", style="filled", fillcolor=lightblue];
cnn [label="Image head\n(optional CNN)"];
mlp [label="Float MLP\n2×Linear+ReLU"];
nrm [label="(x−μ)/σ\nper feature"];
cat [label="Concat\n(B, D_vis + D_float)"];
trk [label="Trunk\n2×Linear+ReLU"];
pi [label="policy_head\nLinear → logits", style="filled", fillcolor=lightyellow];
v [label="value_head\nLinear → V(s)", style="filled", fillcolor=lightyellow];
img -> cnn;
flt -> nrm -> mlp;
cnn -> cat;
mlp -> cat;
cat -> trk -> pi;
trk -> v;
}](../_images/graphviz-f40820934518727c01f991f10e225032c1309c4a.png)
Variant A: CNN actor-critic
Class: PpoActorCritic in ppo_actor_critic.py. Example YAML: config_ppo_cnn_mlp.yaml (minimal) or config_ppo.yaml with nn.fusion_mode: none and nn.vis.cnn.
Image branch
If nn.vis.no_image is false and nn.vis.cnn is present, the stem calls the same _build_img_head as IQN (trackmania_rl/agents/iqn.py) with flags taken directly from nn.vis.cnn: use_impala_cnn, impala_model_size, use_spectral_norm, use_adaptive_maxpool, adaptive_maxpool_size. The conv output is flattened to conv_head_output_dim.
Unlike IQN, this path does not read btr: — only nn.vis.cnn. (BTR is an IQN-only bundle.)
If no_image is true or there is no CNN stem, img_head is omitted and the trunk input is float-only.
Float branch
Normalize with buffers
float_inputs_mean/float_inputs_std.Two linear layers with ReLU:
float_input_dim → float_hidden_dim → float_hidden_dim.
Width float_hidden_dim comes from get_config().float_hidden_dim → nn.float.mlp.hidden_dim (encoder.mlp override applies to fusion PPO only, not this variant).
Fusion and trunk
With image:
h = concat(CNN(img), float_MLP(float)).Trunk:
Linear → ReLU → Linear → ReLUwith widthdense_hidden_dimension.
Heads
policy_head:dense_hidden_dimension → n_actions * n_actions_per_block.value_head:dense_hidden_dimension → 1.
At inference and training, evaluate_actions computes log-probability and
entropy from the categorical defined by logits (product of N categoricals
in multi-action mode).
![digraph ppo_evaluate {
rankdir=TB;
node [shape=box, fontname="Helvetica", fontsize=10];
fwd [label="forward(img, float)\n→ logits, value"];
rs [label="reshape logits\n(B,N,A) if multi-action"];
cat [label="Categorical(logits)\nper head / factor"];
out [label="log π(a|s), H[π], V(s)", style="filled", fillcolor=lightgreen];
fwd -> rs -> cat -> out;
}](../_images/graphviz-83af5993b4e54846d9f2ce0b6b29059cfadb74f2.png)
Variant B: Hugging Face vision backbone
Enabled when nn.fusion_mode is none and nn.vis.transformer.use_hf_backbone is true
(requires pip install -e ".[policy]"). Class: HfActorCritic in
hf_actor_critic.py.
Factory: make_hf_ppo_network_pair in ppo_wiring.make_network.
![digraph ppo_hf {
rankdir=LR;
node [shape=box, fontname="Helvetica", fontsize=10];
img [label="img\n(B,1,H,W)", style="filled", fillcolor=lightblue];
flt [label="float_inputs\n(B,F)", style="filled", fillcolor=lightblue];
prep [label="resize / RGB /\nprocessor norm"];
vit [label="HF AutoModel\n(CLS token)"];
fmlp [label="Float MLP +\nLinear → hidden"];
cat [label="Concat\n(B, 2·H)"];
trk [label="Trunk + heads\n(same idea as CNN PPO)"];
img -> prep -> vit -> cat;
flt -> fmlp -> cat;
cat -> trk;
}](../_images/graphviz-b93157c944c5f07066dcc7cd62dc6bda55e08042.png)
Pixels are interpolated to the processor’s height/width, duplicated to 3
channels if needed, mapped from [-1,1] to [0,1], then normalized with
the processor’s mean/std when available. Float features use the same two-layer
MLP as the CNN variant, then a linear projection to the backbone hidden_size
so that image CLS and float embeddings are concatenated before the shared
trunk.
Variant C: Multimodal fusion (nn.fusion_mode ≠ none)
When nn.fusion_mode is one of vision_transformer, post_concat, or unified, ppo_wiring.make_network builds TorchMultimodalActorCritic (multimodal_torch_fusion.py). The multimodal bundle exposed as get_config().transformers combines nn.fusion_mode, nn.init_from_pretrained, and nn.encoder.transformer (plus nn.vis.transformer for the image side).
Common after fusion: Linear → ReLU → Linear → ReLU trunk of width nn.decoder.dense_hidden_dimension (same config field name as IQN; PPO reads it as dense_hidden_dim), then policy / value linear heads.
Float width: float_hidden_dim_effective() = nn.encoder.mlp.hidden_dim if set, else nn.float.mlp.hidden_dim.
Sub-modes
vision_transformerImage → either (a) native
PatchEmbed2d+nn.TransformerEncoderon patch tokens + mean-pool, usingnn.vis.transformer(d_model,n_layers,n_heads,ff_mult,dropout,patch_size), or (b) HF backbone (CLS) + optionalvis_refineencoder whenuse_hf_backbone: true(requirestransformers). Float → two-layer MLP. Fusion: concat(image_emb, float_emb) →bridgeLinear todense_hidden_dim→ trunk.post_concatImage → if
use_image_headand the vision branch is CNN,_build_img_headwith flags fromnn.vis.cnn(IMPALA / adaptive pool / spectral norm as configured). Native patch or HF vision usenn.vis.transformerinstead. Float →fused_vectorlayout: two-layer MLP (widthfloat_hidden_dim_effective()), then concat with the vision vector and projection to a token sequence (lengthnn.encoder.transformer.post_concat_seq_len).token_sequencelayout (e.g.float_token_layout: per_feature) uses raw float tokens without that MLP. Then learned positions, fusionnn.TransformerEncoderfromnn.encoder.transformer(when notlinear), pool →bridge→ trunk.Hub round-trip: fusion
save_pretrained/from_pretrainedJSON may includerulka_transformers.vis_cnn(dump ofnn.vis.cnn) so CNN stems match after reload; older hubs withoutvis_cnnfall back to the baseline 4-conv kwargs for the CNN branch.unifiedSingle joint encoder over image token(s) and learned float token(s) (
unified_float_tokens). No separate float MLP in this mode (float raw features go throughfloat_to_tokens). Native patch vision:vis.transformer.d_modelmust equalencoder.transformer.d_model(schema). CNN vision contributes one token; HF vision contributes N patch tokens (Ninferred from the HF backbone when the model is built). Fusion trunk: samefusion_encoderoptions as other multimodal modes (defaultnative_transformerwhen encoder is not HF; elsehf_embeddingperinfer_fusion_encoder).
![digraph ppo_fusion_modes {
rankdir=TB;
node [shape=box, fontname="Helvetica", fontsize=10];
m1 [label="vision_transformer:\nimage (patch or HF CLS) + float MLP\n→ concat → bridge"];
m2 [label="post_concat:\nCNN ∥ float MLP → tokenize\n→ Enc_fusion → pool → bridge"];
m3 [label="unified:\npatch tokens ∥ float tokens\n→ Enc_fusion → pool → bridge"];
tr [label="Trunk + policy / value heads", style="filled", fillcolor=lightgreen];
m1 -> tr;
m2 -> tr;
m3 -> tr;
}](../_images/graphviz-6a70d4d148e4a7b6f103f488afbc66254ec81c21.png)
Patch geometry: nn.vis.transformer.patch_size must divide H_downsized and W_downsized for native vision_transformer and unified. post_concat ignores patch size on the image side (CNN stem).
Optional warm start: nn.init_from_pretrained (Rulka fusion save_pretrained directory) after build; trust flags follow nn.encoder.transformer.trust_remote_code.
Example YAML for post_concat + HF two-tower fusion: config_ppo_transformer.yaml. For native vision_transformer (patch + Linear fuse, no HF), start from config_ppo.yaml and set nn.fusion_mode: vision_transformer with nn.vis.transformer.use_hf_backbone: false. See Neural network YAML (nn) — full reference in Configuration Guide.
IQN vs PPO (same inputs, different heads)
Aspect |
IQN (IQN architecture) |
PPO (this page) |
|---|---|---|
Output |
Distributional Q(s,a,τ) via quantile embedding + dueling |
π(a|s) logits + V(s) |
Training |
Replay buffer, n-step, quantile Huber, target network |
On-policy rollouts, GAE, clipped surrogate, no replay |
Exploration |
ε-greedy / Boltzmann / NoisyNet (config) |
Stochastic policy sample from Categorical |
Training flow (high level) — training.algorithm: ppo only
The following loop runs in trackmania_rl.multiprocess.learner_ppo when the
algorithm is PPO. It does not apply to DPO or GRPO (those learners reuse
collectors and often the same rollout tensor builder, but substitute their own
losses).
![digraph ppo_train {
rankdir=TB;
node [shape=box, fontname="Helvetica", fontsize=10];
col [label="Collectors:\nPPOInferer\nnetwork(img, float)"];
q [label="Rollout queues:\nlog p, V, states, actions"];
rew [label="ppo_rewards:\nvectorized rewards +\npotential shaping (γΦ'−Φ)"];
gae [label="learner_ppo:\nGAE → Â, R"];
loss [label="PPO loss:\nclip + c_v·L_V − c_e·H"];
col -> q -> rew -> gae -> loss;
}](../_images/graphviz-ba7e48824ce202b7a47ec08e16b932383ba498d7.png)
Per-step training objective (schematic): the learner minimizes a sum of clipped
policy surrogate, value error (often with clipping vs old values), and
negative entropy (i.e. entropy bonus). ppo_loss_components in
trackmania_rl/agents/policy_optimization/ppo.py implements the algebra; exact
coefficients and schedules come from ppo: in YAML (PPO configuration (ppo:) in
Configuration Guide).
![digraph ppo_loss_schematic {
rankdir=TB;
node [shape=box, fontname="Helvetica", fontsize=10];
inp [label="batch:\nlog π_old, V_old, a, r, done\n+ forward: log π_θ, V_θ", style="filled", fillcolor=lightblue];
rat [label="ratio r = exp(log π_θ − log π_old)"];
clip [label="L_clip = min(r·Â, clip(r)·Â)"];
lv [label="L_V: MSE or clipped value vs returns R"];
ent [label="entropy H[π_θ]\n−c_e · mean(H)"];
sum [label="loss = −L_clip + c_v·L_V − c_e·H\n(minimize)", style="filled", fillcolor=lightgreen];
inp -> rat -> clip -> sum;
inp -> lv -> sum;
inp -> ent -> sum;
}](../_images/graphviz-b77016b09783eb9f0fb37368f88bdf3072e3a8f6.png)
Reading the loss diagram:
Clipped policy branch: if the ratio \(r\) moves outside \([1-\varepsilon, 1+\varepsilon]\), the objective flattens — so the update does not over-reward actions that the new policy already exploits much more than the old one.
Value branch: pulls \(V_\theta\) toward returns (optionally clipped to old values) — so the critic tracks what will happen from each state, which in turn makes GAE’s advantages more accurate.
Entropy branch: subtracts mean entropy from the loss (i.e. maximize entropy) — so sampling stays diverse early in training.
Advantage flow: compute_gae consumes step rewards, V(s_t) at collection
time, dones, and a bootstrap value at the rollout tail; it outputs per-step
advantages \(\hat{A}_t\) and returns \(R_t\) for the value target.
![digraph ppo_gae_flow {
rankdir=LR;
node [shape=box, fontname="Helvetica", fontsize=10];
r [label="rewards r_t", style="filled", fillcolor=lightblue];
v [label="values V(s_t)", style="filled", fillcolor=lightblue];
d [label="dones", style="filled", fillcolor=lightblue];
gae [label="compute_gae\n(γ, λ)"];
out [label="Â_t , R_t", style="filled", fillcolor=lightgreen];
r -> gae;
v -> gae;
d -> gae;
gae -> out;
}](../_images/graphviz-2426e55e35c31468d321c657661df03de1f5d8db.png)
Rollout payload (per episode chunk): collectors enqueue Python lists the learner
turns into GPU tensors. DPO/GRPO reuse the same keys for the shared builder
(policy_rollout_batch).
![digraph ppo_rollout_payload {
rankdir=LR;
node [shape=box, fontname="Helvetica", fontsize=10];
col [label="collector\nPPOInferer step", style="filled", fillcolor=lightyellow];
f [label="frames[]"];
sf [label="state_float[]"];
a [label="actions[]"];
lp [label="ppo_log_probs[]"];
pv [label="ppo_values[]"];
q [label="rollout_queue\n+ end_race_stats", style="filled", fillcolor=lightpink];
col -> f -> q;
col -> sf -> q;
col -> a -> q;
col -> lp -> q;
col -> pv -> q;
}](../_images/graphviz-69fdd0db10ba940a0692c2174d2b2e6860258639.png)
Why every queue field matters: frames / state_float reconstruct
\(s_t\); actions are the labels for \(\log\pi(a|s)\);
ppo_log_probs are \(\log\pi_{\mathrm{old}}\) at collection time;
ppo_values bootstrap GAE and the value loss. end_race_stats drives logging
and some reward edge cases (e.g. finish flags).
Collectors run the compiled (or eager) actor-critic on CUDA; append
ppo_log_probs,ppo_values, frames,state_float, actions.Learner aggregates rollouts until
ppo.rollout_steps_per_update, builds tensors on GPU, computes rewards aligned with IQN’s dense + engineered terms (reward_vectorized+ fold; seerollout_rewards.py).GAE uses scheduled
γ/λ(and optionalppo_*_schedulein config).Optimizer updates the same network used in collectors; weights are copied to the shared inference copy under a lock.
Key design notes
Float inputs are always used when
float_input_dim > 0; they are not auxiliary metadata. With image off, the policy is float-only.Shared trunk means gradients from policy and value both affect CNN/float representations (unless you freeze modules elsewhere).
Schedules: learning rate and optional PPO coefficients can follow piecewise schedules on the global frame counter (see configuration guide).
Implementation references
trackmania_rl/agents/policy_models/ppo_actor_critic.py— CNN PPO network.trackmania_rl/agents/policy_models/multimodal_torch_fusion.py— native Transformer multimodal fusion (nn.fusion_mode/get_config().transformers); IQN reuses it without policy heads.trackmania_rl/nn_build/vis_cnn_head.py— kwargs for_build_img_headfromnn.vis.cnn(IQN, PPO CNN, multimodal CNN branch, BC, pretrain Level 0 whenrl_config_pathis used).trackmania_rl/agents/policy_models/hf_actor_critic.py— HF backbone PPO.trackmania_rl/agents/algorithms/ppo_wiring.py— factory,PPOInferer, compile warmup hook.trackmania_rl/agents/policy_optimization/ppo.py— GAE, clipped loss.trackmania_rl/agents/policy_optimization/rollout_rewards.py— full TM rewards for PPO.trackmania_rl/reward_vectorized.py— shared dense reward + potentials.trackmania_rl/multiprocess/learner_ppo.py— PPO learner loop.trackmania_rl/multiprocess/collector_process.py— attachesPPOInfererfor any policy-optimization algorithm (ppo,dpo,grpo).
See also
GRPO: network and training — same policy network; group-relative trajectory training.
DPO (preference learning, same network): DPO configuration (dpo:) in Configuration Guide.
NN topology catalog (supported stacks) — full matrix of supported
nntopologies.IQN architecture — baseline value-based architecture.
BTR options (IQN + paper extras) — IQN-only extras (not applied to PPO CNN factory).
Configuration Guide —
ppo:,training:,nn:.