Model architectures

This section describes what tensors flow where for IQN and for policy optimization (PPO, DPO, GRPO — the last two reuse the PPO actor-critic and ppo_wiring). YAML knobs live in Neural network YAML (nn) — full reference and BTR block (btr:) (Configuration Guide). For a tabular catalog of every supported stack (fusion modes, vision branches, fusion trunks, IQN decoder, pitfalls), see NN topology catalog (supported stacks).

Which stack when?

You set	Algorithm	Architecture page / module
`training.algorithm: iqn`, `nn.fusion_mode: none`, `nn.vis.cnn` or `no_image`	IQN	IQN architecture — `IQN_Network` (`iqn.py`); optional BTR options (IQN + paper extras).
`training.algorithm: iqn`, `nn.fusion_mode: none`, `nn.vis.transformer.use_hf_backbone: true`	IQN	IQN architecture — `IQNSharedBackboneNetwork` + headless `HfActorCritic` (`hf_actor_critic.py`).
`training.algorithm: iqn`, `nn.fusion_mode` in `vision_transformer` / `post_concat` / `unified`	IQN	IQN architecture — `IQNSharedBackboneNetwork` + headless `TorchMultimodalActorCritic` (`multimodal_torch_fusion.py`).
`training.algorithm: ppo`, `nn.fusion_mode: none`, `nn.vis.cnn` (or `no_image`)	PPO	PPO actor-critic architecture — `PpoActorCritic` (`ppo_actor_critic.py`).
`training.algorithm: ppo`, `nn.fusion_mode: none`, `nn.vis.transformer.use_hf_backbone: true`	PPO	PPO actor-critic architecture — `HfActorCritic` (`hf_actor_critic.py`); extra `pip install -e ".[policy]"`.
`training.algorithm: ppo`, `nn.fusion_mode` in `vision_transformer` / `post_concat` / `unified`	PPO	PPO actor-critic architecture — `TorchMultimodalActorCritic` (`multimodal_torch_fusion.py`).
`training.algorithm: dpo`, same `nn` choices as PPO (CNN / HF / fusion)	DPO	PPO actor-critic architecture — same built modules; training / pairs under DPO configuration (dpo:) in Configuration Guide.
`training.algorithm: grpo`, same `nn` choices as PPO (CNN / HF / fusion)	GRPO	GRPO: network and training — same modules as PPO; PPO actor-critic architecture for tensor routing diagrams.

IQN is distributional off-policy (quantile Q, replay, target net). PPO is on-policy actor-critic (logits + V, GAE, clipped objective, no replay). DPO keeps that actor-critic but trains from preference pairs (chosen vs rejected trajectories; DPO configuration (dpo:)). GRPO is on-policy with the same actor-critic but group-relative trajectory returns and no PPO clip (see GRPO: network and training). Multimodal IQN and PPO (and DPO / GRPO) share the same fusion body when nn.fusion_mode matches; IQN swaps policy/value heads for iqn_fc + dueling heads. PPO/DPO/GRPO never use IQN’s target net slot weights2.torch.

BTR is not a separate training.algorithm: it is optional flags on top of IQN (same IQN_Network). See BTR options (IQN + paper extras).

Reference configs (config_files/rl/): config_default.yaml / config_btr.yaml (classic IQN); config_btr_post_concat_cnn_transformer.yaml (BTR + multimodal post_concat + CNN + fusion transformer). config_ppo.yaml and siblings define nn layouts usable for multimodal IQN if you set training.algorithm: iqn; config_dpo.yaml / config_grpo.yaml mirror the PPO stack with dpo: / grpo: blocks.

Model architectures

Which stack when?

Contents