Model architectures
This section describes what tensors flow where for IQN and for policy
optimization (PPO, DPO, GRPO — the last two reuse the PPO actor-critic
and ppo_wiring). YAML knobs live in Neural network YAML (nn) — full reference and
BTR block (btr:) (Configuration Guide). For a tabular catalog
of every supported stack (fusion modes, vision branches, fusion trunks, IQN
decoder, pitfalls), see NN topology catalog (supported stacks).
Which stack when?
You set |
Algorithm |
Architecture page / module |
|---|---|---|
|
IQN |
IQN architecture — |
|
IQN |
IQN architecture — |
|
IQN |
IQN architecture — |
|
PPO |
PPO actor-critic architecture — |
|
PPO |
PPO actor-critic architecture — |
|
PPO |
PPO actor-critic architecture — |
|
DPO |
PPO actor-critic architecture — same built modules; training / pairs under DPO configuration (dpo:) in Configuration Guide. |
|
GRPO |
GRPO: network and training — same modules as PPO; PPO actor-critic architecture for tensor routing diagrams. |
IQN is distributional off-policy (quantile Q, replay, target net). PPO is on-policy actor-critic (logits + V, GAE, clipped objective, no replay). DPO keeps that actor-critic but trains from preference pairs (chosen vs rejected trajectories; DPO configuration (dpo:)). GRPO is on-policy with the same actor-critic but group-relative trajectory returns and no PPO clip (see GRPO: network and training). Multimodal IQN and PPO (and DPO / GRPO) share the same fusion body when nn.fusion_mode matches; IQN swaps policy/value heads for iqn_fc + dueling heads. PPO/DPO/GRPO never use IQN’s target net slot weights2.torch.
BTR is not a separate training.algorithm: it is optional flags on top of IQN (same IQN_Network). See BTR options (IQN + paper extras).
Reference configs (config_files/rl/): config_default.yaml / config_btr.yaml (classic IQN); config_btr_post_concat_cnn_transformer.yaml (BTR + multimodal post_concat + CNN + fusion transformer). config_ppo.yaml and siblings define nn layouts usable for multimodal IQN if you set training.algorithm: iqn; config_dpo.yaml / config_grpo.yaml mirror the PPO stack with dpo: / grpo: blocks.
Contents
- NN topology catalog (supported stacks)
- IQN architecture
- PPO actor-critic architecture
- Why this stack is shaped this way
- Routing: which network is built?
- Training stack (processes and modules)
- Overview
- Variant A: CNN actor-critic
- Variant B: Hugging Face vision backbone
- Variant C: Multimodal fusion (
nn.fusion_mode ≠ none) - IQN vs PPO (same inputs, different heads)
- Training flow (high level) —
training.algorithm: ppoonly - Key design notes
- Implementation references
- See also
- GRPO: network and training
- What GRPO is doing here (why each idea)
- Algorithm placement (same code paths as PPO)
- Training stack (processes)
- Policy network (identical to PPO)
- Rollout → GPU batch (one trajectory)
- Forming a group and group-relative advantages
- Policy loss (inner epochs)
- End-to-end training loop (summary)
- PPO vs GRPO (architecture vs credit assignment)
- Implementation references
- See also
- BTR options (IQN + paper extras)