Model architectures

This section describes what tensors flow where for IQN and for policy optimization (PPO, DPO, GRPO — the last two reuse the PPO actor-critic and ppo_wiring). YAML knobs live in Neural network YAML (nn) — full reference and BTR block (btr:) (Configuration Guide). For a tabular catalog of every supported stack (fusion modes, vision branches, fusion trunks, IQN decoder, pitfalls), see NN topology catalog (supported stacks).

Which stack when?

You set

Algorithm

Architecture page / module

training.algorithm: iqn, nn.fusion_mode: none, nn.vis.cnn or no_image

IQN

IQN architectureIQN_Network (iqn.py); optional BTR options (IQN + paper extras).

training.algorithm: iqn, nn.fusion_mode: none, nn.vis.transformer.use_hf_backbone: true

IQN

IQN architectureIQNSharedBackboneNetwork + headless HfActorCritic (hf_actor_critic.py).

training.algorithm: iqn, nn.fusion_mode in vision_transformer / post_concat / unified

IQN

IQN architectureIQNSharedBackboneNetwork + headless TorchMultimodalActorCritic (multimodal_torch_fusion.py).

training.algorithm: ppo, nn.fusion_mode: none, nn.vis.cnn (or no_image)

PPO

PPO actor-critic architecturePpoActorCritic (ppo_actor_critic.py).

training.algorithm: ppo, nn.fusion_mode: none, nn.vis.transformer.use_hf_backbone: true

PPO

PPO actor-critic architectureHfActorCritic (hf_actor_critic.py); extra pip install -e ".[policy]".

training.algorithm: ppo, nn.fusion_mode in vision_transformer / post_concat / unified

PPO

PPO actor-critic architectureTorchMultimodalActorCritic (multimodal_torch_fusion.py).

training.algorithm: dpo, same nn choices as PPO (CNN / HF / fusion)

DPO

PPO actor-critic architecturesame built modules; training / pairs under DPO configuration (dpo:) in Configuration Guide.

training.algorithm: grpo, same nn choices as PPO (CNN / HF / fusion)

GRPO

GRPO: network and trainingsame modules as PPO; PPO actor-critic architecture for tensor routing diagrams.

IQN is distributional off-policy (quantile Q, replay, target net). PPO is on-policy actor-critic (logits + V, GAE, clipped objective, no replay). DPO keeps that actor-critic but trains from preference pairs (chosen vs rejected trajectories; DPO configuration (dpo:)). GRPO is on-policy with the same actor-critic but group-relative trajectory returns and no PPO clip (see GRPO: network and training). Multimodal IQN and PPO (and DPO / GRPO) share the same fusion body when nn.fusion_mode matches; IQN swaps policy/value heads for iqn_fc + dueling heads. PPO/DPO/GRPO never use IQN’s target net slot weights2.torch.

BTR is not a separate training.algorithm: it is optional flags on top of IQN (same IQN_Network). See BTR options (IQN + paper extras).

Reference configs (config_files/rl/): config_default.yaml / config_btr.yaml (classic IQN); config_btr_post_concat_cnn_transformer.yaml (BTR + multimodal post_concat + CNN + fusion transformer). config_ppo.yaml and siblings define nn layouts usable for multimodal IQN if you set training.algorithm: iqn; config_dpo.yaml / config_grpo.yaml mirror the PPO stack with dpo: / grpo: blocks.

Contents