GRPO: network and training

This page documents GRPO (group-relative policy optimization) when training.algorithm is grpo: the same discrete shared-trunk actor-critic as PPO (PPO actor-critic architecture), with a trajectory-level objective based on group-relative returns instead of PPO’s per-step GAE and clipped ratio.

Implementation: trackmania_rl.agents.policy_optimization.grpo (advantages and policy objective), trackmania_rl.multiprocess.learner_grpo (learner loop). trackmania_rl.agents.algorithms.registry maps "grpo" to trackmania_rl.agents.algorithms.ppo_wiring — identical network factory and PPOInferer rollout path as PPO.

YAML under grpo: (flat grpo_* on get_config()): GRPO configuration (grpo:) in Configuration Guide. Reference config: config_files/rl/config_grpo.yaml.

What GRPO is doing here (why each idea)

Reuse ``ppo_wiring`` and the same actor-critic. You keep one vision + float design (Variants A/B/C on PPO actor-critic architecture) and the same checkpoints / collector contract. Only the learner changes: no GAE, no PPO clip — so you can experiment with group-based credit without redefining nn.

Trajectory scalar :math:`R_i`. Each rollout is turned into per-step rewards (same dense + engineered shaping as PPO). Summing them gives one number per trajectory segment. Why: GRPO compares whole chunks of behavior (e.g. how far you got on the map in that run), not individual timesteps, so the signal is aligned with “this run was good/bad relative to other runs collected now.”

Wait for ``grpo_group_size`` valid batches. Short or malformed segments are dropped (same minimum length as the shared tensor builder). Why: advantages are defined only within a fixed group; partial groups would bias which trajectories enter training.

Group-relative advantages :math:`A_i`. Subtract the group mean (and optionally scale by group std) so $\sum_i A_i = 0$. Why: you learn from which trajectory beat the others in this batch, not from absolute return scale (which drifts with reward schedules and map difficulty). Better-than-average runs get positive $A_i$ and are reinforced; worse-than-average are discouraged.

Policy term :math:`-A_i sum_t logpi(a_{t}mid s_t)`. Classic REINFORCE on the full trajectory, weighted by $A_i$. Why no PPO ratio: data are always on-policy for the current $\theta$ inside each inner epoch; the code recomputes $\log\pi$ with the live policy, so there is no stale behavior policy to correct with a ratio.

Recompute :math:`logpi` for ``grpo_update_epochs`` passes. Why: multiple Adam steps on the same $K$ trajectories extract more from expensive env interaction, similar in spirit to PPO epochs — but still without a clip, so grpo_max_grad_norm and moderate learning rates matter.

Entropy term. Same role as in PPO: encourage stochasticity so the policy does not collapse to a single action mode too early.

Optional ``ref_policy`` + ``grpo_ref_kl_coef``. A frozen copy (periodically synced) evaluates $\log\pi_{\mathrm{ref}}$. Why: penalize deviation from a reference snapshot or slowly moving anchor — useful if you want conservative updates or started from a strong prior. If the coefficient is 0, no extra forward passes on the reference.

Value head unused in the loss. Still computed at collection so queues and build_policy_rollout_tensors stay one code path with PPO/DPO. Why not strip it: less duplication and easier switching between ppo, dpo, and grpo; only the learner ignores $V$ for GRPO’s objective.

Shared network + lock after each group update. Collectors must act with weights the learner just produced. Why: same real-time sync story as PPO — parallel envs, single source of truth for inference weights.

Algorithm placement (same code paths as PPO)

GRPO does not introduce a new nn topology: config → registry → ``ppo_wiring`` → ``make_network`` builds the same classes as PPO (Variant A/B/C on PPO actor-critic architecture). Only ``learner_grpo`` replaces ``learner_ppo``.

$digraph grpo_code_stack { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; yaml [label="YAML:\ntraining.algorithm = grpo\n+ grpo: block", style="filled", fillcolor=lightcyan]; reg [label="registry.get_wiring(\"grpo\")\n→ ppo_wiring module"]; mk [label="ppo_wiring.make_network()\nPpoActorCritic | HfActorCritic |\nTorchMultimodalActorCritic", style="filled", fillcolor=lightgreen]; col [label="collector_process\nPPOInferer"]; lr [label="learner_grpo.py\n(group loss)", style="filled", fillcolor=lightyellow]; yaml -> reg -> mk; mk -> col; mk -> lr; }$

Training stack (processes)

Same multiprocess layout as PPO: collectors fill queues; one learner process consumes rollouts. The learner holds the trainable policy, a frozen ref_policy copy (optional KL term), and syncs weights into uncompiled_shared_network after each group update.

Why a separate ``ref_policy`` node in the diagram: KL regularization needs two forwards — trainable $\pi_\theta$ and fixed $\pi_{\mathrm{ref}}$ — without mixing gradients into the reference. Periodic load_state_dict from the live policy decides how “stale” the anchor is.

$digraph grpo_process_stack { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; train [label="scripts/train.py", style="rounded,filled", fillcolor=lightcyan]; lp [label="learner_process.py\nalgorithm == grpo → learner_grpo", style="filled", fillcolor=lightyellow]; cp [label="collector_process.py × N"]; inf [label="PPOInferer\n(forward + sample + log p, V)"]; lgr [label="learner_grpo.py\nK rollouts → advantages → loss"]; ref [label="ref_policy\n(deepcopy, eval, no grad)\noptional KL", style="filled", fillcolor=lightsteelblue]; pol [label="policy (trainable)", style="filled", fillcolor=lightgreen]; sh [label="uncompiled_shared_network\n+ lock", style="filled", fillcolor=lightpink]; q [label="rollout_queues"]; train -> lp; train -> cp; cp -> inf -> pol; inf -> q [label="put"]; q -> lgr; lp -> lgr; lgr -> pol; lgr -> ref [style=dashed, label="forward no grad"]; lgr -> sh [label="load_state_dict"]; inf -> sh [style=dashed, label="inference weights"]; }$

Policy network (identical to PPO)

All tensor routing (image + float → trunk → logits + $V$) is on PPO actor-critic architecture. The conceptual forward at collection and training is:

$digraph grpo_forward { rankdir=LR; node [shape=box, fontname="Helvetica", fontsize=10]; img [label="obs_img\n(T,1,H,W)", style="filled", fillcolor=lightblue]; fl [label="obs_float\n(T,F)", style="filled", fillcolor=lightblue]; act [label="actions\n(from rollout)", style="filled", fillcolor=lightblue]; trunk [label="shared trunk\n(same as PPO)", style="filled", fillcolor=lightyellow]; ev [label="evaluate_actions\n(img, float, actions)", style="filled", fillcolor=wheat]; out [label="log p, entropy, V\nGRPO: sum log p over T;\nV unused in loss", style="filled", fillcolor=lightgreen]; vnote [label="value head still runs\n(grad flows only via\npolicy + entropy paths)", shape=note, fontsize=9]; img -> trunk; fl -> trunk; trunk -> ev; act -> ev; ev -> out; ev -> vnote [style=dotted]; }$

Collection: collectors still store ppo_values for parity with the tensor builder; GRPO ignores them in the objective. Entropy can be averaged per trajectory inside the learner when stacking the group.

Rollout → GPU batch (one trajectory)

build_policy_rollout_tensors (policy_rollout_batch.py) aligns frames, state_float, actions, ppo_log_probs, and ppo_values, then calls ppo_rewards_and_dones_from_rollout for per-step rewards (dense + engineered, same as PPO). Invalid segments (too few steps) return None and are dropped.

$digraph grpo_rollout_batch { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; env [label="Env rollout dict:\nframes[], state_float[],\nactions[], ppo_log_probs[],\nppo_values[]", style="filled", fillcolor=lightblue]; build [label="build_policy_rollout_tensors"]; t [label="GPU tensors:\nobs_img, obs_float, actions\nrewards (T,), dones\nold_logp, old_values", style="filled", fillcolor=lightyellow]; R [label="R_i = sum_t rewards[t]\n(scalar per trajectory)", style="filled", fillcolor=lightgreen]; env -> build -> t -> R; }$

Forming a group and group-relative advantages

The learner buffers valid batches until it has exactly grpo_group_size trajectories $\tau_1,\ldots,\tau_K$. Each has a scalar return $R_i = \sum_t r_{i,t}$. Advantages are detached and zero-mean across the group:

mean: $A_i = R_i - \frac{1}{K}\sum_j R_j$.
mean_std: center, then divide by group std (with stabilizer). Why ``mean_std``: when absolute return spread changes a lot across training, scaling keeps gradient scale more stable than centering alone.

$digraph grpo_group_adv { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; b1 [label="batch τ_1 → R_1", style="filled", fillcolor=lightblue]; b2 [label="batch τ_2 → R_2", style="filled", fillcolor=lightblue]; bk [label="batch τ_K → R_K", style="filled", fillcolor=lightblue]; dot [label="...", shape=plaintext]; grp [label="stack [R_1..R_K]"]; adv [label="group_relative_advantages\nmean | mean_std", style="filled", fillcolor=lightyellow]; out [label="A_1..A_K\n(detached)", style="filled", fillcolor=lightgreen]; b1 -> grp; b2 -> grp; dot -> grp; bk -> grp; grp -> adv -> out; }$

Policy loss (inner epochs)

For each grpo_update_epochs pass, the learner recomputes $\log\pi_\theta(\tau_i)=\sum_t \log\pi_\theta(a_{i,t}\mid s_{i,t})$ with the current $\theta$. The policy term is $\mathcal{L}_\pi = -\frac{1}{K}\sum_i A_i \log\pi(\tau_i)$. Optional reference term uses ref_policy with torch.no_grad() on the reference branch; ref_policy is refreshed from the live policy every grpo_ref_sync_every_updates group updates.

$digraph grpo_loss_detail { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; batches [label="K batches\n(obs, actions)", style="filled", fillcolor=lightblue]; subgraph cluster_pi { label="Trainable policy (each traj i)"; style=dashed; e1 [label="policy.evaluate_actions\n→ logp (T,), ent"]; s1 [label="traj_logp_i = sum_t logp"]; e1 -> s1; } subgraph cluster_ref { label="Optional ref (grpo_ref_kl_coef > 0)"; style=dashed; e2 [label="ref_policy.evaluate_actions\n(no grad)"]; kl [label="mean over steps of\n(log pi_theta - log pi_ref)"]; e2 -> kl; } batches -> e1; batches -> e2; advn [label="A_i (detached)", style="filled", fillcolor=lightcyan]; lpi [label="L_pi = -mean_i A_i * traj_logp_i"]; le [label="- grpo_ent_coef * mean(entropy)"]; lkl [label="+ grpo_ref_kl_coef * KL term"]; tot [label="total loss -> backward\nGradScaler + clip_grad_norm\nAdam step", style="filled", fillcolor=lightgreen]; advn -> lpi; s1 -> lpi; lpi -> tot; le -> tot; kl -> lkl -> tot; }$

End-to-end training loop (summary)

$digraph grpo_train { rankdir=TB; node [shape=box, fontname="Helvetica", fontsize=10]; col [label="Collectors:\nPPOInferer\n(same as PPO)"]; q [label="Rollout queues:\nframes, float, actions,\nlog p, V"]; b [label="build_policy_rollout_tensors\n(step rewards, dones)"]; buf [label="Buffer until K valid\n(grpo_group_size)"]; adv [label="group_relative_advantages\non R_i = sum r"]; ep [label="grpo_update_epochs ×\nforward + loss + step"]; sync [label="sync ref_policy\nevery N updates"]; sh [label="shared network\nfor collectors"]; col -> q -> b -> buf -> adv -> ep -> sh; ep -> sync [style=dashed]; }$

PPO vs GRPO (architecture vs credit assignment)

Architecture (CNN / HF / fusion, two inputs, two heads) is the same; only the learner’s use of outputs differs.

$digraph ppo_vs_grpo { rankdir=LR; node [shape=box, fontname="Helvetica", fontsize=10]; subgraph cluster_ppo { label="PPO learner"; style=filled; fillcolor="#f0f8ff"; p1 [label="Many steps in rollout buffer"]; p2 [label="GAE per timestep\nÂ_t, R_t"]; p3 [label="Clipped ratio + value loss + H"]; p1 -> p2 -> p3; } subgraph cluster_grpo { label="GRPO learner"; style=filled; fillcolor="#fff8f0"; g1 [label="K full trajectories"]; g2 [label="Scalar R_i per traj\nA_i within group"]; g3 [label="−A_i Σ log π + H\n(+ optional ref)"]; g1 -> g2 -> g3; } net [label="Same actor-critic\n(ppo_wiring)", style="filled", fillcolor=lightgreen]; net -> p1; net -> g1; }$

Aspect	PPO (PPO actor-critic architecture)	GRPO (this page)
Credit assignment	GAE on step rewards with $\gamma$, $\lambda$	Scalar return $R_i$ per trajectory; group-relative $A_i$
Policy objective	Clipped importance ratio vs behavior policy	REINFORCE-style $-A_i \sum_t \log\pi(a_t\|s_t)$ (no ratio clip)
Value head	Used (value loss + bootstrap for GAE)	Not used in the loss (still computed at collection)
Batch shape	`ppo.rollout_steps_per_update` then minibatches	Exactly `grpo_group_size` trajectories per update

Implementation references

trackmania_rl/agents/policy_optimization/grpo.py — group centering and policy objective.
trackmania_rl/multiprocess/learner_grpo.py — GRPO learner loop, reference policy, TensorBoard.
trackmania_rl/multiprocess/policy_rollout_batch.py — build_policy_rollout_tensors, grpo_scheduled_float.
trackmania_rl/agents/algorithms/ppo_wiring.py — network factory (shared with PPO/DPO).
trackmania_rl/multiprocess/collector_process.py — policy-optimization collectors (PPO/DPO/GRPO).
config_files/config_schema.py — GRPOConfig.