.. _iqn_architecture:

IQN architecture
================

The diagrams on this page are rendered with **Graphviz** (the ``sphinx.ext.graphviz`` extension). The CI workflow that publishes the docs installs Graphviz so the diagrams render on the site. For a local docs build, install Graphviz and ensure ``dot`` is on your PATH, or you will see raw DOT code instead of images.

This page describes the **structure** of **IQN_Network** (``trackmania_rl.agents.iqn``), **how training works** (data flow, replay, loss), and **why** the main design choices were made.

What we use
-----------

- **IQN** — value-based RL with a *distributional* head: we predict **quantiles** of the return (sum of future rewards), not just its expectation. For each (state, action) we get K values (one per quantile τ ∈ (0,1)); their mean is the usual Q(s,a). This often improves sample efficiency and stability (see *Why distributional* below).
- **Discrete actions** — e.g. 12 classes (steer × accel × brake binned), defined in ``config_files/inputs_list.py`` and ``config.inputs``.
- **Inputs** — (1) **image**: one grayscale frame per step, downscaled (e.g. 64×64); (2) **float state**: scalar features (position along track, zone indices, previous actions, etc.), see ``state_normalization`` and ``float_input_dim`` in config.

Training loop (data flow)
-------------------------

1. **Collectors** — Several game instances run in parallel. Each has an **inference_network** (copy of the policy). The agent observes (frame, float_state), chooses an action (ε-greedy or Boltzmann over mean Q), and sends it to the game. Transitions (state, action, reward, next_state, …) are sent to the learner via a queue.
2. **Learner** — One process. It holds the **online_network** (updated by gradients) and the **target_network** (periodically synced with the online network, e.g. soft update with τ=0.02 or hard update every N steps). It also maintains an **uncompiled_shared_network**: the learner copies online → shared; collectors copy shared → their inference_network. So collectors always use a slightly stale but consistent policy.
3. **Replay buffer** — Transitions are stored in a **ReplayBuffer** (e.g. prioritized). Sampling is done in **mini-batches**; each batch is then passed through **buffer_collate_function**, which implements the **mini-race** logic (see *Why mini-races* below).
4. **train_on_batch** — For each batch we compute TD targets using the **target** network and current rewards/gammas; we compute Q(s,a) for the sampled (s,a) using the **online** network; we minimize the quantile Huber loss between targets and outputs, then backprop and update the online network.

**Why separate online and target?** Standard in DQN: the target is held fixed for many steps so the learning signal is stable; otherwise we would be chasing a moving target (bootstrapping from a network we keep changing).

Why distributional (quantiles)
------------------------------

In standard DQN we learn one number Q(s,a) = E[return]. In **IQN** we learn the *distribution* of the return via its **quantiles**: for τ ∈ (0,1), we predict the τ-quantile (e.g. τ=0.1 = pessimistic, τ=0.5 = median). The network is trained with **quantile Huber loss** so that predicted quantiles match the distribution of TD targets.

**Why this helps:** (1) **Richer signal** — the full distribution captures risk and uncertainty. (2) **Better gradient flow** — multiple quantiles provide more learning signal per transition than a single scalar. (3) **Stability** — distributional methods (IQN, QR-DQN, C51) often reduce overestimation and improve convergence.

We use **implicit** quantiles: τ is sampled (or fixed) per forward pass and embedded via cos(π·i·τ); the state representation is repeated K times and mixed with this embedding. Config: ``iqn_n`` (e.g. 8) quantiles during training; ``iqn_k`` (e.g. 32) during inference for action selection (we average over quantiles then choose argmax).

Why dueling (V + A)
-------------------

We decompose **Q(s,a) = V(s) + (A(s,a) − mean_a A(s,a))**. The **value** V(s) is shared across all actions; the **advantage** A(s,a) is per action.

**Why:** In many states the value is similar for all actions (e.g. straight road); learning V(s) once is more sample-efficient than learning each Q(s,a) separately. See Dueling DQN (Wang et al., 2016). The subtraction of mean(A) keeps the decomposition unique (otherwise V and A are underdetermined).

Why Double DQN (optional)
-------------------------

Config: ``use_ddqn: true`` (default). In plain DQN the TD target uses the **target** network both to *choose* the best next action and to *evaluate* it → tends to **overestimate** Q. In **Double DQN** we use the **online** network to *choose* the best next action and the **target** network only to *evaluate* that action → usually reduces overestimation.

**In our code:** In ``train_on_batch``, when ``use_ddqn`` is True we take ``a* = argmax_a Q_online(s', a)`` then form the target as ``r + γ Q_target(s', a*)``; when False we use ``r + γ max_a Q_target(s', a)``.

Why mini-races (clipped horizon)
--------------------------------

When we sample a batch, **buffer_collate_function** does the following: (1) For each transition it draws a **random horizon** (in number of actions) up to ``temporal_mini_race_duration_actions`` (e.g. 7 seconds). This horizon is stored in ``state_float[:, 0]`` so the network sees “time left in this mini-race.” (2) **Rewards** and **gammas** are reindexed so that we only sum rewards *up to that horizon*; beyond the horizon we treat the transition as terminal (gamma=0). (3) **Potential-based shaping** is applied: we add (γ φ(s') − φ(s)) to the reward so that the value of progress is preserved without changing optimal policies (Ng et al.).

**Why:** **Credit assignment** — we only ask “how much reward in this short window?”, which simplifies learning. **Gamma = 1 over the window** — we can use γ=1 within the 7s window because the horizon is fixed and short. **Same buffer, different views** — the same transition can be interpreted as different “mini-races” on different samples, which increases diversity. See ``trackmania_rl.buffer_utilities.buffer_collate_function`` and config ``temporal_mini_race_duration_ms``, ``n_steps``, ``gamma_schedule``.

Normalization
-------------

- **Image** — In ``IQN_Network.forward()`` we do ``(img - 128) / 128`` (assuming input in [0, 255]) → approximately [-1, 1]. This matches Level 0 / BC pretraining when ``image_normalization: "iqn"`` is set, so that loading a pretrained encoder into ``img_head`` does not require renormalization.
- **Float state** — We apply ``(float_inputs - mean) / std`` in ``forward()``; ``mean`` and ``std`` come from config (``state_normalization.float_inputs_mean`` and ``float_inputs_std``).

Pretrained encoder (Level 0 / BC)
---------------------------------

The **image head** (CNN) of ``IQN_Network`` has the same architecture as the encoder saved by **Level 0** (autoencoder/SimCLR) and **Level 1 BC** pretraining. We can **load** a pretrained ``encoder.pt`` into ``img_head`` (config: ``pretrain_encoder_path``). BC pretrain additionally trains the same CNN to predict actions from frames; the encoder is then transferred to IQN’s ``img_head``. See :doc:`../pretrain_replay_roadmap` and :doc:`../pretrain_bc`.

Overview: inputs and outputs
----------------------------

This network does **distributional** RL: it models the *distribution* of the return (sum of rewards), not just its expectation. Standard DQN outputs one Q(s,a) = E[return]; here we predict **quantiles** of that distribution — for each τ ∈ (0,1) we get the τ-quantile (e.g. τ=0.1 = pessimistic, τ=0.5 = median). We get K values per (state, action); averaging them gives the usual Q(s,a), but the full set captures uncertainty and often improves learning (IQN, QR-DQN, C51 are distributional methods).

**Quantiles τ** — In IQN, for each τ ∈ (0,1) we predict the τ-quantile of the return distribution (e.g. τ=0.1 = “pessimistic” scenario, τ=0.5 = median, τ=0.9 = “optimistic”). The network is trained to match these quantiles via the quantile Huber loss. So we get K values Q(s,a,τ₁), …, Q(s,a,τₖ) per state and action instead of one; averaging them gives the usual Q(s,a), but the full set captures uncertainty.

**Replication (“repeating” state K times)** — We have one state representation after concat, shape (B, D). For each state we need Q for K different τ. So we *repeat* that representation K times → (B×K, D), and for each of the B×K rows we compute a τ-dependent embedding and mix it (Hadamard) with the repeated state. The result is (B×K, D): one row per (state, quantile). The A and V heads then output Q of shape (B×K, n_actions). So “replication” is: one state → K rows (one per τ) so we get K quantile estimates per state in one forward pass.

**Dueling** — We decompose Q(s,a) = V(s) + A(s,a), where V(s) is the state value and A(s,a) is the advantage of action a (we use Q = V + A - mean(A) so the decomposition is unique). In many states the value is similar across actions; learning V(s) once and small advantages per action is more sample-efficient than learning each Q(s,a) from scratch. See Dueling DQN (Wang et al., 2016).

**Inputs:**

- **img** — Screen image tensor: shape ``(batch_size, 1, H, W)``, dtype float32/float16. Values are normalized in ``forward()`` as ``(img - 128) / 128`` (if given as uint8, normalization is done in ``Inferer``).
- **float_inputs** — Vector of scalar state features (position, zones, previous actions, etc.): shape ``(batch_size, float_input_dim)``. Normalized in ``forward()`` as ``(float_inputs - mean) / std`` from config.
- **num_quantiles** — Number of quantiles (N or N' in the IQN paper), e.g. 8 during training.
- **tau** (optional) — Tensor of quantiles with shape ``(batch_size * num_quantiles, 1)``. If not provided, quantiles are sampled inside the network (symmetrically around 0.5).

**Outputs:**

- **Q** — Q-values for each (state, quantile): shape ``(batch_size * num_quantiles, n_actions)``.
- **tau** — The quantiles used: shape ``(batch_size * num_quantiles, 1)``.

The network uses a **dueling** layout: from a single shared representation it computes value V and advantages A, then Q = V + A - mean(A).

High-level diagram (main blocks)
--------------------------------

Below is a block diagram of the main components only: what enters the network and how data flows to the Q output.

.. graphviz::

   digraph iqn_high_level {
      rankdir=LR;
      node [shape=box, fontname="Helvetica", fontsize=11];
      edge [fontname="Helvetica", fontsize=10];

      subgraph inputs {
        node [fillcolor=lightblue, style="filled"];
        img [label="img\n(B, 1, H, W)"];
        floats [label="float_inputs\n(B, float_input_dim)"];
        tau_in [label="τ (quantiles)\noptional"];
      }

      subgraph backbone {
        node [fillcolor=lightyellow, style="filled"];
        img_head [label="Image head\n(CNN)"];
        float_head [label="Float head\n(MLP)"];
        concat [label="Concat"];
        iqn_block [label="IQN: τ-embed × concat"];
        dueling [label="Dueling\nA_head + V_head"];
      }

      subgraph outputs {
        node [fillcolor=lightgreen, style="filled"];
        Q_out [label="Q, τ\n(B×K, n_actions)"];
      }

      img -> img_head;
      floats -> float_head;
      img_head -> concat [label="conv_out"];
      float_head -> concat [label="float_hidden"];
      concat -> iqn_block;
      tau_in -> iqn_block [style=dashed];
      iqn_block -> dueling;
      dueling -> Q_out;
   }

- **Image head** — CNN over the frame; outputs one vector per sample (details below).
- **Float head** — Two-layer MLP over scalar features; output size matches the float branch dimension (details below).
- **Concat** — Concatenation of the two heads’ outputs along the last axis; dimension = ``conv_head_output_dim + float_hidden_dim`` (this is ``dense_input_dimension`` in the code).
- **IQN block** — Quantiles τ are turned into an embedding (cos + linear layer), then element-wise (Hadamard) product with the repeated concat; output shape ``(B×K, dense_input_dimension)`` (details below).
- **Dueling** — From this representation the advantage head A and value head V are computed, then Q = V + A - mean(A) (details below).

Block details
-------------

Image head (CNN)
~~~~~~~~~~~~~~~

Four convolutions with LeakyReLU and Flatten. Channel sizes: 1 → 16 → 32 → 64 → 32. Output is one vector per sample (size depends on H, W; set in config via ``w_downsized``, ``h_downsized``, e.g. 64×64).

.. graphviz::

   digraph img_head {
      rankdir=LR;
      node [shape=box, fontname="Helvetica", fontsize=10];
      in [label="(B, 1, H, W)", fillcolor=lightblue, style="filled"];
      c1 [label="Conv2d 4×4, s=2\n1→16"];
      c2 [label="Conv2d 4×4, s=2\n16→32"];
      c3 [label="Conv2d 3×3, s=2\n32→64"];
      c4 [label="Conv2d 3×3, s=1\n64→32"];
      flat [label="Flatten"];
      out [label="(B, conv_head_output_dim)", fillcolor=lightgreen, style="filled"];
      in -> c1 -> c2 -> c3 -> c4 -> flat -> out;
   }

Each Conv2d is followed by LeakyReLU (inplace). Weights are initialized orthogonally with the appropriate gain for LeakyReLU.

Float head (MLP)
~~~~~~~~~~~~~~~~

Two linear layers with LeakyReLU. Input normalization (mean/std) is applied in ``forward()`` before this head.

.. graphviz::

   digraph float_head {
      rankdir=LR;
      node [shape=box, fontname="Helvetica", fontsize=10];
      in [label="(B, float_input_dim)\nnormalized", fillcolor=lightblue, style="filled"];
      l1 [label="Linear → float_hidden_dim"];
      l2 [label="Linear → float_hidden_dim"];
      out [label="(B, float_hidden_dim)", fillcolor=lightgreen, style="filled"];
      in -> l1 -> l2 -> out;
   }

Config parameter: ``float_hidden_dim`` (e.g. 256). LeakyReLU after each Linear.

IQN: quantile embedding and mixing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Quantiles τ (shape ``(B×K, 1)``) are mapped to an embedding using the IQN formula: ``cos(π · i · τ)`` for i = 1..iqn_embedding_dimension, then one linear layer + LeakyReLU to dimension ``dense_input_dimension``. The state vector (concat) is repeated K times (one per quantile), then multiplied by the quantile embedding (Hadamard). The result is a representation of shape ``(B×K, dense_input_dimension)`` that depends on both state and τ.

.. graphviz::

   digraph iqn_detail {
      rankdir=TB;
      node [shape=box, fontname="Helvetica", fontsize=10];
      tau [label="τ (B×K, 1)", fillcolor=lightblue, style="filled"];
      concat_in [label="concat (B, D)", fillcolor=lightblue, style="filled"];
      cos [label="cos(π·i·τ)\n(B×K, iqn_embed_dim)"];
      fc [label="Linear + LeakyReLU\n→ (B×K, D)"];
      repeat [label="repeat K times\n(B×K, D)"];
      hadamard [label="× (element-wise)"];
      out [label="(B×K, D)", fillcolor=lightgreen, style="filled"];
      tau -> cos -> fc;
      concat_in -> repeat;
      fc -> hadamard;
      repeat -> hadamard;
      hadamard -> out;
   }

Dueling: A_head and V_head
~~~~~~~~~~~~~~~~~~~~~~~~~~

From the shared representation ``(B×K, dense_input_dimension)`` two heads are computed:

- **A_head**: Linear(D → dense_hidden_dimension//2) → LeakyReLU → Linear → ``(B×K, n_actions)``.
- **V_head**: Linear(D → dense_hidden_dimension//2) → LeakyReLU → Linear → ``(B×K, 1)``.

Then: ``Q = V + A - mean(A, dim=actions)``. This yields Q-values for all actions and all quantiles.

.. graphviz::

   digraph dueling_detail {
      rankdir=TB;
      node [shape=box, fontname="Helvetica", fontsize=10];
      in [label="(B×K, D)\nstate × quantile", fillcolor=lightyellow, style="filled"];
      a1 [label="Linear D→512"];
      a2 [label="Linear 512→n_actions"];
      v1 [label="Linear D→512"];
      v2 [label="Linear 512→1"];
      A [label="A (B×K, n_actions)"];
      V [label="V (B×K, 1)"];
      Q [label="Q = V + A - mean(A)", fillcolor=lightgreen, style="filled"];
      in -> a1 -> a2 -> A;
      in -> v1 -> v2 -> V;
      A -> Q;
      V -> Q;
   }

Config parameter ``dense_hidden_dimension`` (e.g. 1024); then the inner layer of the heads is 512. The final layers of A_head and V_head are initialized orthogonally without extra gain.

Other implementation details
-----------------------------

- **Prioritized replay** — Optional (``prio_alpha > 0``): transitions are sampled with probability proportional to TD error; importance weights are applied to the loss so that the update remains unbiased.
- **Gradient clipping** — We clip gradients by value (``clip_grad_value``) and by norm (``clip_grad_norm``) to avoid explosions.
- **Target self-loss clamping** — We scale the per-sample loss so that the target’s “self-loss” (target vs target) does not dominate; this stabilizes quantile regression. See ``target_self_loss_clamp_ratio`` and the running averages in ``Trainer.train_on_batch``.
- **Exploration** — At inference we use ε-greedy or Boltzmann over the **mean** of the K quantile outputs (config: ``exploration`` section). So we still act on a single scalar Q per action, but that scalar is the average of the distributional output.

Config parameters
-----------------

Main dimensions are set in ``config_files/rl/config_default.yaml`` (section ``neural_network``):

- **w_downsized**, **h_downsized** — Input frame size for the CNN (e.g. 64×64).
- **float_hidden_dim** — Output size of the float head (256).
- **dense_hidden_dimension** — Hidden size in the A and V heads (1024).
- **iqn_embedding_dimension** — Dimension of the cos-embedding of quantiles (128).
- **iqn_n** — Number of quantiles during training (8); **iqn_k** — during inference (32).

**float_input_dim** is computed at config load time (depends on number of zones, previous actions, etc.). **conv_head_output_dim** is computed from H, W via ``calculate_conv_output_dim()`` in ``iqn.py``.

See also
--------

- :doc:`iqn` — Experiments on IQN variants (DDQN, embedding size, image size).
- :doc:`../../main_objects` — IQN_Network, buffer, rollout_results.
- :doc:`../../first_training` — How to run training and what to expect.
- :doc:`../../configuration_guide` — All config options (neural_network, training, rewards, etc.).
- The ``IQN_Network`` class and ``forward()`` method in ``trackmania_rl.agents.iqn``.