.. _ppo_architecture: PPO actor-critic architecture ============================= This page documents the policy and value network used for **on-policy policy optimization** with a **discrete** shared-trunk actor-critic. The **same** stacks and ``ppo_wiring`` factory are used when ``training.algorithm`` is ``ppo``, ``dpo``, or ``grpo``; **this page** details the **network** (Variants A/B/C) and the **PPO** training loop (GAE, clipped surrogate, value loss). **DPO** keeps the same bodies but trains from preference pairs; **GRPO** uses group-relative trajectory returns — see :doc:`grpo_architecture`. When ``training.algorithm`` is ``ppo`` specifically, you get a **shared-trunk actor-critic** as described below. Implementation lives under ``trackmania_rl.agents.policy_models`` and is wired via ``trackmania_rl.agents.algorithms.ppo_wiring``. For the value-based baseline (quantile IQN + replay), see :doc:`iqn_architecture`. **BTR** (:doc:`btr_architecture`) applies to **IQN only**. **Variant A** PPO reads ``nn.vis.cnn`` (same ``_build_img_head`` flags as IQN, without merging ``btr:``). **Fusion** variants that use the **CNN** vision branch (``infer_vis_branch`` → ``cnn``) also call ``_build_img_head`` with kwargs resolved from ``nn.vis.cnn`` — the same single source as IQN and Variant A PPO (``trackmania_rl/nn_build/vis_cnn_head.py``). ``TorchMultimodalActorCritic`` (without policy heads) backs **IQN** when ``training.algorithm`` is ``iqn`` and ``nn.fusion_mode != none``. YAML knobs for PPO routing and vision (``nn.fusion_mode``, ``nn.vis``, ``nn.float``, ``nn.encoder``): :ref:`nn-yaml-reference` in :doc:`../configuration_guide`. Optional :ref:`nn-rl-parameter-freeze` (e.g. ``nn.vis.freeze``, ``nn.encoder.freeze``, ``nn.decoder.shared_trunk_freeze``) applies to PPO the same way as to IQN where documented. Why this stack is shaped this way --------------------------------- **Image + float inputs.** TrackMania gives both rendered frames and a normalized float vector (geometry, speed, gear, …). Vision learns what the road *looks* like; the float path carries signals that are tedious to infer from pixels alone. **Why both feed one trunk:** a single representation is used for action logits and for the value baseline, so both tasks co-adapt the same features. **Shared trunk, two heads (policy + value).** The **policy head** defines a categorical over discrete actions (including multi-offset layouts). The **value head** predicts expected return from each state. **Why actor-critic:** the critic feeds GAE and the value loss, which reduces variance of policy gradients compared to pure Monte Carlo returns. **On-policy PPO loop.** Data are generated with the *current* policy, then discarded after a few epochs of updates. **Why not replay (here):** keeps the off-policy correction simple; PPO’s clipped ratio explicitly limits change w.r.t. the policy that collected the batch, which stabilizes training when rewards are dense and correlated. **Collectors vs learner.** Environments run in parallel processes; inference must be fast. Collectors only **forward + sample** and enqueue lists; the learner does **backward + optimizer** on aggregated GPU batches. **Why store ``ppo_log_probs``:** PPO’s ratio compares :math:`\pi_\theta` to :math:`\pi_{\mathrm{old}}` — the policy that actually produced the actions on the rollout. **GAE (:math:`\gamma`, :math:`\lambda`).** Trade off bias vs variance of advantage estimates using the value function and n-step structure. **Why:** raw one-step TD is noisy; full Monte Carlo is high-variance; GAE interpolates. **Clipped surrogate.** Penalty if the new policy assigns much more probability to the taken actions than the behavior policy did. **Why:** approximate trust region — large policy jumps on one minibatch tend to break the on-policy assumption. **Value loss + entropy bonus.** The critic is trained toward return targets; the entropy term discourages premature collapse to deterministic actions. **Why:** without entropy, exploration from stochastic sampling fades as logits sharpen. Routing: which network is built? -------------------------------- ``ppo_wiring.make_network`` chooses **exactly one** implementation (first match wins). For an **uncompiled** policy on CPU (e.g. BC with ``bc_use_rl_architecture``), the same routing is implemented by ``ppo_wiring.build_ppo_policy_uncompiled`` (no ``torch.compile`` / forced CUDA). 1. If ``get_config().transformers.fusion_mode`` (i.e. ``nn.fusion_mode``) is **not** ``none`` → **Variant C** — ``TorchMultimodalActorCritic`` (native ``torch.nn.TransformerEncoder`` stacks; HF vision only inside ``vision_transformer`` when ``nn.vis.transformer.use_hf_backbone``). 2. Else if ``nn.vis.transformer`` is set **and** ``use_hf_backbone`` is true → **Variant B** — ``HfActorCritic`` (Hugging Face ``AutoModel`` CLS + float MLP + shared trunk). 3. Else → **Variant A** — ``PpoActorCritic`` (``nn.vis.cnn`` image head via the same ``_build_img_head`` kwargs as IQN, or float-only if ``no_image`` / no CNN). **Why three variants.** **A** is the default conv + MLP path (fast, full control of CNN flags). **B** plugs in a **pretrained HF** vision backbone when you want transfer from large-scale image pretraining. **C** uses **fusion transformers** so image and float features interact through attention (and optional hub round-trip), instead of a single early concat — useful when alignment between modalities is subtle. .. note:: If ``fusion_mode: none`` but YAML declares **only** ``nn.vis.transformer`` with ``use_hf_backbone: false`` (no ``cnn``), the CNN branch sees no image stem → **float-only** PPO (zeros image tensor at inference). For CNN PPO, keep ``nn.vis.cnn``. Training stack (processes and modules) -------------------------------------- ``scripts/train.py`` starts a **learner** process and several **collector** processes. For ``training.algorithm: ppo``, the learner runs ``learner_ppo``; collectors attach ``PPOInferer`` and push rollouts into multiprocessing queues. The **same** policy weights exist as a compiled CUDA module in collectors and as trainable parameters in the learner; after each PPO update the learner copies state dict into the shared ``uncompiled`` copy under ``shared_network_lock`` (collectors refresh their view from that copy — same pattern as IQN’s weight sync). .. graphviz:: digraph ppo_process_stack { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; train [label="scripts/train.py", style="rounded,filled", fillcolor=lightcyan]; lp [label="learner_process.py\nif algorithm == ppo → learner_ppo", style="filled", fillcolor=lightyellow]; cp [label="collector_process.py × N\nis_policy_optimization_algorithm()", style="filled", fillcolor=lightyellow]; inf [label="PPOInferer\n(forward + sample + log p, V)"]; lppo [label="learner_ppo.py\nrollout batch → GAE → PPO loss → Adam"]; pol [label="policy network\n(make_network)", style="filled", fillcolor=lightgreen]; sh [label="uncompiled_shared_network\n+ shared_network_lock", style="filled", fillcolor=lightpink]; q [label="rollout_queues\n(multiprocessing)"]; train -> lp; train -> cp; lp -> lppo; cp -> inf; inf -> pol; lppo -> pol; inf -> q [label="put"]; q -> lppo [label="get"]; lppo -> sh [label="load_state_dict"]; inf -> sh [style=dashed, label="weights for inference"]; } **Registry:** ``training.algorithm: ppo`` resolves to ``trackmania_rl.agents.algorithms.ppo_wiring`` via ``registry.get_wiring()`` (same module also serves DPO/GRPO for **network** build only). **``uncompiled_shared_network`` and the lock.** After each update the learner writes weights into a shared module; collectors read that snapshot for inference. **Why:** one authoritative weight tensor for many parallel games without training inside env processes. Overview -------- Like IQN, the model consumes two branches: - **Image** ``(B, 1, H, W)`` — grayscale frame (or a zero tensor if the image head is disabled); - **Float** ``(B, float_input_dim)`` — the same normalized state vector as IQN (waypoints, gear, velocity, etc.). Outputs: - **Policy:** logits for a categorical distribution over actions. Single-decision mode: ``(B, n_actions)``. Multi-action mode (``rl_action_offsets_ms`` with more than one offset): ``(B, n_actions_per_block * n_actions)`` reshaped to ``(B, N, n_actions)`` inside ``evaluate_actions``. - **Value:** scalar ``V(s)`` per sample, ``(B, 1)`` before squeeze. **Float normalization** :math:`(x-\mu)/\sigma` (running buffers) matches IQN so BC / IQN / PPO can share statistics. **Why:** stable MLP inputs when raw speeds and distances have different scales. .. graphviz:: digraph ppo_overview { rankdir=LR; node [shape=box, fontname="Helvetica", fontsize=10]; img [label="img\n(B,1,H,W)", style="filled", fillcolor=lightblue]; flt [label="float_inputs\n(B,F)", style="filled", fillcolor=lightblue]; cnn [label="Image head\n(optional CNN)"]; mlp [label="Float MLP\n2×Linear+ReLU"]; nrm [label="(x−μ)/σ\nper feature"]; cat [label="Concat\n(B, D_vis + D_float)"]; trk [label="Trunk\n2×Linear+ReLU"]; pi [label="policy_head\nLinear → logits", style="filled", fillcolor=lightyellow]; v [label="value_head\nLinear → V(s)", style="filled", fillcolor=lightyellow]; img -> cnn; flt -> nrm -> mlp; cnn -> cat; mlp -> cat; cat -> trk -> pi; trk -> v; } Variant A: CNN actor-critic --------------------------- Class: ``PpoActorCritic`` in ``ppo_actor_critic.py``. Example YAML: ``config_ppo_cnn_mlp.yaml`` (minimal) or ``config_ppo.yaml`` with ``nn.fusion_mode: none`` and ``nn.vis.cnn``. Image branch ~~~~~~~~~~~~ If ``nn.vis.no_image`` is false and ``nn.vis.cnn`` is present, the stem calls the **same** ``_build_img_head`` as IQN (``trackmania_rl/agents/iqn.py``) with flags taken **directly** from ``nn.vis.cnn``: ``use_impala_cnn``, ``impala_model_size``, ``use_spectral_norm``, ``use_adaptive_maxpool``, ``adaptive_maxpool_size``. The conv output is flattened to ``conv_head_output_dim``. Unlike IQN, this path does **not** read ``btr:`` — only ``nn.vis.cnn``. (BTR is an IQN-only bundle.) If ``no_image`` is true or there is no CNN stem, ``img_head`` is omitted and the trunk input is **float-only**. Float branch ~~~~~~~~~~~~ 1. Normalize with buffers ``float_inputs_mean`` / ``float_inputs_std``. 2. Two linear layers with ReLU: ``float_input_dim → float_hidden_dim → float_hidden_dim``. Width ``float_hidden_dim`` comes from ``get_config().float_hidden_dim`` → ``nn.float.mlp.hidden_dim`` (``encoder.mlp`` override applies to **fusion** PPO only, not this variant). Fusion and trunk ~~~~~~~~~~~~~~~~ - With image: ``h = concat(CNN(img), float_MLP(float))``. - Trunk: ``Linear → ReLU → Linear → ReLU`` with width ``dense_hidden_dimension``. Heads ~~~~~ - ``policy_head``: ``dense_hidden_dimension → n_actions * n_actions_per_block``. - ``value_head``: ``dense_hidden_dimension → 1``. At inference and training, ``evaluate_actions`` computes **log-probability** and **entropy** from the categorical defined by logits (product of ``N`` categoricals in multi-action mode). .. graphviz:: digraph ppo_evaluate { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; fwd [label="forward(img, float)\n→ logits, value"]; rs [label="reshape logits\n(B,N,A) if multi-action"]; cat [label="Categorical(logits)\nper head / factor"]; out [label="log π(a|s), H[π], V(s)", style="filled", fillcolor=lightgreen]; fwd -> rs -> cat -> out; } Variant B: Hugging Face vision backbone --------------------------------------- Enabled when ``nn.fusion_mode`` is ``none`` and ``nn.vis.transformer.use_hf_backbone`` is ``true`` (requires ``pip install -e ".[policy]"``). Class: ``HfActorCritic`` in ``hf_actor_critic.py``. Factory: ``make_hf_ppo_network_pair`` in ``ppo_wiring.make_network``. .. graphviz:: digraph ppo_hf { rankdir=LR; node [shape=box, fontname="Helvetica", fontsize=10]; img [label="img\n(B,1,H,W)", style="filled", fillcolor=lightblue]; flt [label="float_inputs\n(B,F)", style="filled", fillcolor=lightblue]; prep [label="resize / RGB /\nprocessor norm"]; vit [label="HF AutoModel\n(CLS token)"]; fmlp [label="Float MLP +\nLinear → hidden"]; cat [label="Concat\n(B, 2·H)"]; trk [label="Trunk + heads\n(same idea as CNN PPO)"]; img -> prep -> vit -> cat; flt -> fmlp -> cat; cat -> trk; } Pixels are interpolated to the processor’s height/width, duplicated to 3 channels if needed, mapped from ``[-1,1]`` to ``[0,1]``, then normalized with the processor’s mean/std when available. Float features use the same two-layer MLP as the CNN variant, then a linear projection to the backbone **hidden_size** so that **image CLS** and **float** embeddings are concatenated before the shared trunk. Variant C: Multimodal fusion (``nn.fusion_mode ≠ none``) -------------------------------------------------------- When ``nn.fusion_mode`` is one of ``vision_transformer``, ``post_concat``, or ``unified``, ``ppo_wiring.make_network`` builds ``TorchMultimodalActorCritic`` (``multimodal_torch_fusion.py``). The **multimodal bundle** exposed as ``get_config().transformers`` combines ``nn.fusion_mode``, ``nn.init_from_pretrained``, and ``nn.encoder.transformer`` (plus ``nn.vis.transformer`` for the image side). **Common after fusion:** ``Linear → ReLU → Linear → ReLU`` trunk of width ``nn.decoder.dense_hidden_dimension`` (same config field name as IQN; PPO reads it as ``dense_hidden_dim``), then policy / value linear heads. **Float width:** ``float_hidden_dim_effective()`` = ``nn.encoder.mlp.hidden_dim`` if set, else ``nn.float.mlp.hidden_dim``. Sub-modes ~~~~~~~~~ ``vision_transformer`` **Image →** either (a) **native** ``PatchEmbed2d`` + ``nn.TransformerEncoder`` on patch tokens + mean-pool, using ``nn.vis.transformer`` (``d_model``, ``n_layers``, ``n_heads``, ``ff_mult``, ``dropout``, ``patch_size``), or (b) **HF** backbone (CLS) + optional ``vis_refine`` encoder when ``use_hf_backbone: true`` (requires ``transformers``). **Float →** two-layer MLP. **Fusion:** concat(image_emb, float_emb) → ``bridge`` Linear to ``dense_hidden_dim`` → trunk. ``post_concat`` **Image →** if ``use_image_head`` and the vision branch is **CNN**, ``_build_img_head`` with flags from ``nn.vis.cnn`` (IMPALA / adaptive pool / spectral norm as configured). Native patch or HF vision use ``nn.vis.transformer`` instead. **Float →** ``fused_vector`` layout: two-layer MLP (width ``float_hidden_dim_effective()``), then concat with the vision vector and projection to a token sequence (length ``nn.encoder.transformer.post_concat_seq_len``). ``token_sequence`` layout (e.g. ``float_token_layout: per_feature``) uses raw float tokens without that MLP. Then learned positions, **fusion** ``nn.TransformerEncoder`` from ``nn.encoder.transformer`` (when not ``linear``), pool → ``bridge`` → trunk. **Hub round-trip:** fusion ``save_pretrained`` / ``from_pretrained`` JSON may include ``rulka_transformers.vis_cnn`` (dump of ``nn.vis.cnn``) so CNN stems match after reload; older hubs without ``vis_cnn`` fall back to the baseline 4-conv kwargs for the CNN branch. ``unified`` Single joint encoder over **image token(s)** and **learned float token(s)** (``unified_float_tokens``). **No** separate float MLP in this mode (float raw features go through ``float_to_tokens``). **Native** patch vision: ``vis.transformer.d_model`` must equal ``encoder.transformer.d_model`` (schema). **CNN** vision contributes **one** token; **HF** vision contributes **N** patch tokens (``N`` inferred from the HF backbone when the model is built). Fusion trunk: same ``fusion_encoder`` options as other multimodal modes (default ``native_transformer`` when encoder is not HF; else ``hf_embedding`` per ``infer_fusion_encoder``). .. graphviz:: digraph ppo_fusion_modes { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; m1 [label="vision_transformer:\nimage (patch or HF CLS) + float MLP\n→ concat → bridge"]; m2 [label="post_concat:\nCNN ∥ float MLP → tokenize\n→ Enc_fusion → pool → bridge"]; m3 [label="unified:\npatch tokens ∥ float tokens\n→ Enc_fusion → pool → bridge"]; tr [label="Trunk + policy / value heads", style="filled", fillcolor=lightgreen]; m1 -> tr; m2 -> tr; m3 -> tr; } **Patch geometry:** ``nn.vis.transformer.patch_size`` must divide ``H_downsized`` and ``W_downsized`` for native ``vision_transformer`` and ``unified``. ``post_concat`` ignores patch size on the image side (CNN stem). Optional **warm start:** ``nn.init_from_pretrained`` (Rulka fusion ``save_pretrained`` directory) after build; trust flags follow ``nn.encoder.transformer.trust_remote_code``. Example YAML for **post_concat** + HF two-tower fusion: ``config_ppo_transformer.yaml``. For native ``vision_transformer`` (patch + Linear fuse, no HF), start from ``config_ppo.yaml`` and set ``nn.fusion_mode: vision_transformer`` with ``nn.vis.transformer.use_hf_backbone: false``. See :ref:`nn-yaml-reference` in :doc:`../configuration_guide`. IQN vs PPO (same inputs, different heads) ----------------------------------------- .. list-table:: :header-rows: 1 :widths: 18 40 42 * - Aspect - IQN (:doc:`iqn_architecture`) - PPO (this page) * - Output - Distributional **Q(s,a,τ)** via quantile embedding + dueling - **π(a|s)** logits + **V(s)** * - Training - Replay buffer, n-step, quantile Huber, target network - On-policy rollouts, **GAE**, clipped surrogate, no replay * - Exploration - ε-greedy / Boltzmann / NoisyNet (config) - Stochastic policy sample from **Categorical** Training flow (high level) — ``training.algorithm: ppo`` only -------------------------------------------------------------- The following loop runs in ``trackmania_rl.multiprocess.learner_ppo`` when the algorithm is **PPO**. It does **not** apply to DPO or GRPO (those learners reuse collectors and often the same rollout tensor builder, but substitute their own losses). .. graphviz:: digraph ppo_train { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; col [label="Collectors:\nPPOInferer\nnetwork(img, float)"]; q [label="Rollout queues:\nlog p, V, states, actions"]; rew [label="ppo_rewards:\nvectorized rewards +\npotential shaping (γΦ'−Φ)"]; gae [label="learner_ppo:\nGAE → Â, R"]; loss [label="PPO loss:\nclip + c_v·L_V − c_e·H"]; col -> q -> rew -> gae -> loss; } **Per-step training objective (schematic):** the learner minimizes a sum of **clipped policy surrogate**, **value error** (often with clipping vs old values), and **negative entropy** (i.e. entropy bonus). ``ppo_loss_components`` in ``trackmania_rl/agents/policy_optimization/ppo.py`` implements the algebra; exact coefficients and schedules come from ``ppo:`` in YAML (:ref:`ppo-config` in :doc:`../configuration_guide`). .. graphviz:: digraph ppo_loss_schematic { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; inp [label="batch:\nlog π_old, V_old, a, r, done\n+ forward: log π_θ, V_θ", style="filled", fillcolor=lightblue]; rat [label="ratio r = exp(log π_θ − log π_old)"]; clip [label="L_clip = min(r·Â, clip(r)·Â)"]; lv [label="L_V: MSE or clipped value vs returns R"]; ent [label="entropy H[π_θ]\n−c_e · mean(H)"]; sum [label="loss = −L_clip + c_v·L_V − c_e·H\n(minimize)", style="filled", fillcolor=lightgreen]; inp -> rat -> clip -> sum; inp -> lv -> sum; inp -> ent -> sum; } Reading the loss diagram: - **Clipped policy branch:** if the ratio :math:`r` moves outside :math:`[1-\varepsilon, 1+\varepsilon]`, the objective flattens — **so the update does not over-reward** actions that the new policy already exploits much more than the old one. - **Value branch:** pulls :math:`V_\theta` toward returns (optionally clipped to old values) — **so the critic tracks** what will happen from each state, which in turn makes GAE’s advantages more accurate. - **Entropy branch:** subtracts mean entropy from the loss (i.e. **maximize** entropy) — **so sampling stays diverse** early in training. **Advantage flow:** ``compute_gae`` consumes step rewards, ``V(s_t)`` at collection time, dones, and a bootstrap value at the rollout tail; it outputs per-step advantages :math:`\hat{A}_t` and returns :math:`R_t` for the value target. .. graphviz:: digraph ppo_gae_flow { rankdir=LR; node [shape=box, fontname="Helvetica", fontsize=10]; r [label="rewards r_t", style="filled", fillcolor=lightblue]; v [label="values V(s_t)", style="filled", fillcolor=lightblue]; d [label="dones", style="filled", fillcolor=lightblue]; gae [label="compute_gae\n(γ, λ)"]; out [label="Â_t , R_t", style="filled", fillcolor=lightgreen]; r -> gae; v -> gae; d -> gae; gae -> out; } **Rollout payload (per episode chunk):** collectors enqueue Python lists the learner turns into GPU tensors. DPO/GRPO reuse the same keys for the shared builder (``policy_rollout_batch``). .. graphviz:: digraph ppo_rollout_payload { rankdir=LR; node [shape=box, fontname="Helvetica", fontsize=10]; col [label="collector\nPPOInferer step", style="filled", fillcolor=lightyellow]; f [label="frames[]"]; sf [label="state_float[]"]; a [label="actions[]"]; lp [label="ppo_log_probs[]"]; pv [label="ppo_values[]"]; q [label="rollout_queue\n+ end_race_stats", style="filled", fillcolor=lightpink]; col -> f -> q; col -> sf -> q; col -> a -> q; col -> lp -> q; col -> pv -> q; } **Why every queue field matters:** ``frames`` / ``state_float`` reconstruct :math:`s_t`; ``actions`` are the labels for :math:`\log\pi(a|s)`; ``ppo_log_probs`` are :math:`\log\pi_{\mathrm{old}}` at collection time; ``ppo_values`` bootstrap GAE and the value loss. ``end_race_stats`` drives logging and some reward edge cases (e.g. finish flags). 1. **Collectors** run the compiled (or eager) actor-critic on CUDA; append ``ppo_log_probs``, ``ppo_values``, frames, ``state_float``, actions. 2. **Learner** aggregates rollouts until ``ppo.rollout_steps_per_update``, builds tensors on GPU, computes **rewards** aligned with IQN’s dense + engineered terms (``reward_vectorized`` + fold; see ``rollout_rewards.py``). 3. **GAE** uses scheduled ``γ`` / ``λ`` (and optional ``ppo_*_schedule`` in config). 4. **Optimizer** updates the **same** network used in collectors; weights are copied to the shared inference copy under a lock. Key design notes ---------------- - **Float inputs are always used** when ``float_input_dim > 0``; they are not auxiliary metadata. With image off, the policy is float-only. - **Shared trunk** means gradients from policy and value both affect CNN/float representations (unless you freeze modules elsewhere). - **Schedules:** learning rate and optional PPO coefficients can follow piecewise schedules on the global frame counter (see configuration guide). Implementation references ------------------------- - ``trackmania_rl/agents/policy_models/ppo_actor_critic.py`` — CNN PPO network. - ``trackmania_rl/agents/policy_models/multimodal_torch_fusion.py`` — native Transformer multimodal fusion (``nn.fusion_mode`` / ``get_config().transformers``); IQN reuses it without policy heads. - ``trackmania_rl/nn_build/vis_cnn_head.py`` — kwargs for ``_build_img_head`` from ``nn.vis.cnn`` (IQN, PPO CNN, multimodal CNN branch, BC, pretrain Level 0 when ``rl_config_path`` is used). - ``trackmania_rl/agents/policy_models/hf_actor_critic.py`` — HF backbone PPO. - ``trackmania_rl/agents/algorithms/ppo_wiring.py`` — factory, ``PPOInferer``, compile warmup hook. - ``trackmania_rl/agents/policy_optimization/ppo.py`` — GAE, clipped loss. - ``trackmania_rl/agents/policy_optimization/rollout_rewards.py`` — full TM rewards for PPO. - ``trackmania_rl/reward_vectorized.py`` — shared dense reward + potentials. - ``trackmania_rl/multiprocess/learner_ppo.py`` — PPO learner loop. - ``trackmania_rl/multiprocess/collector_process.py`` — attaches ``PPOInferer`` for any policy-optimization algorithm (``ppo``, ``dpo``, ``grpo``). See also -------- - :doc:`grpo_architecture` — same policy network; group-relative trajectory training. - DPO (preference learning, same network): :ref:`dpo-config` in :doc:`../configuration_guide`. - :doc:`nn_topology_catalog` — full matrix of supported ``nn`` topologies. - :doc:`iqn_architecture` — baseline value-based architecture. - :doc:`btr_architecture` — IQN-only extras (not applied to PPO CNN factory). - :doc:`../configuration_guide` — ``ppo:``, ``training:``, ``nn:``.