IQN architecture

The diagrams on this page are rendered with Graphviz (the sphinx.ext.graphviz extension). The CI workflow that publishes the docs installs Graphviz so the diagrams render on the site. For a local docs build, install Graphviz and ensure dot is on your PATH, or you will see raw DOT code instead of images.

This page describes the structure of IQN_Network (trackmania_rl.agents.iqn), how training works (data flow, replay, loss), and why the main design choices were made.

What we use

  • IQN — value-based RL with a distributional head: we predict quantiles of the return (sum of future rewards), not just its expectation. For each (state, action) we get K values (one per quantile τ ∈ (0,1)); their mean is the usual Q(s,a). This often improves sample efficiency and stability (see Why distributional below).

  • Discrete actions — e.g. 12 classes (steer × accel × brake binned), defined in config_files/inputs_list.py and config.inputs.

  • Inputs — (1) image: one grayscale frame per step, downscaled (e.g. 64×64); (2) float state: scalar features (position along track, zone indices, previous actions, etc.), see state_normalization and float_input_dim in config.

Training loop (data flow)

  1. Collectors — Several game instances run in parallel. Each has an inference_network (copy of the policy). The agent observes (frame, float_state), chooses an action (ε-greedy or Boltzmann over mean Q), and sends it to the game. Transitions (state, action, reward, next_state, …) are sent to the learner via a queue.

  2. Learner — One process. It holds the online_network (updated by gradients) and the target_network (periodically synced with the online network, e.g. soft update with τ=0.02 or hard update every N steps). It also maintains an uncompiled_shared_network: the learner copies online → shared; collectors copy shared → their inference_network. So collectors always use a slightly stale but consistent policy.

  3. Replay buffer — Transitions are stored in a ReplayBuffer (e.g. prioritized). Sampling is done in mini-batches; each batch is then passed through buffer_collate_function, which implements the mini-race logic (see Why mini-races below).

  4. train_on_batch — For each batch we compute TD targets using the target network and current rewards/gammas; we compute Q(s,a) for the sampled (s,a) using the online network; we minimize the quantile Huber loss between targets and outputs, then backprop and update the online network.

Why separate online and target? Standard in DQN: the target is held fixed for many steps so the learning signal is stable; otherwise we would be chasing a moving target (bootstrapping from a network we keep changing).

Why distributional (quantiles)

In standard DQN we learn one number Q(s,a) = E[return]. In IQN we learn the distribution of the return via its quantiles: for τ ∈ (0,1), we predict the τ-quantile (e.g. τ=0.1 = pessimistic, τ=0.5 = median). The network is trained with quantile Huber loss so that predicted quantiles match the distribution of TD targets.

Why this helps: (1) Richer signal — the full distribution captures risk and uncertainty. (2) Better gradient flow — multiple quantiles provide more learning signal per transition than a single scalar. (3) Stability — distributional methods (IQN, QR-DQN, C51) often reduce overestimation and improve convergence.

We use implicit quantiles: τ is sampled (or fixed) per forward pass and embedded via cos(π·i·τ); the state representation is repeated K times and mixed with this embedding. Config: iqn_n (e.g. 8) quantiles during training; iqn_k (e.g. 32) during inference for action selection (we average over quantiles then choose argmax).

Why dueling (V + A)

We decompose Q(s,a) = V(s) + (A(s,a) − mean_a A(s,a)). The value V(s) is shared across all actions; the advantage A(s,a) is per action.

Why: In many states the value is similar for all actions (e.g. straight road); learning V(s) once is more sample-efficient than learning each Q(s,a) separately. See Dueling DQN (Wang et al., 2016). The subtraction of mean(A) keeps the decomposition unique (otherwise V and A are underdetermined).

Why Double DQN (optional)

Config: use_ddqn: true (default). In plain DQN the TD target uses the target network both to choose the best next action and to evaluate it → tends to overestimate Q. In Double DQN we use the online network to choose the best next action and the target network only to evaluate that action → usually reduces overestimation.

In our code: In train_on_batch, when use_ddqn is True we take a* = argmax_a Q_online(s', a) then form the target as r + γ Q_target(s', a*); when False we use r + γ max_a Q_target(s', a).

Why mini-races (clipped horizon)

When we sample a batch, buffer_collate_function does the following: (1) For each transition it draws a random horizon (in number of actions) up to temporal_mini_race_duration_actions (e.g. 7 seconds). This horizon is stored in state_float[:, 0] so the network sees “time left in this mini-race.” (2) Rewards and gammas are reindexed so that we only sum rewards up to that horizon; beyond the horizon we treat the transition as terminal (gamma=0). (3) Potential-based shaping is applied: we add (γ φ(s’) − φ(s)) to the reward so that the value of progress is preserved without changing optimal policies (Ng et al.).

Why: Credit assignment — we only ask “how much reward in this short window?”, which simplifies learning. Gamma = 1 over the window — we can use γ=1 within the 7s window because the horizon is fixed and short. Same buffer, different views — the same transition can be interpreted as different “mini-races” on different samples, which increases diversity. See trackmania_rl.buffer_utilities.buffer_collate_function and config temporal_mini_race_duration_ms, n_steps, gamma_schedule.

Normalization

  • Image — In IQN_Network.forward() we do (img - 128) / 128 (assuming input in [0, 255]) → approximately [-1, 1]. This matches Level 0 / BC pretraining when image_normalization: "iqn" is set, so that loading a pretrained encoder into img_head does not require renormalization.

  • Float state — We apply (float_inputs - mean) / std in forward(); mean and std come from config (state_normalization.float_inputs_mean and float_inputs_std).

Pretrained encoder (Level 0 / BC)

The image head (CNN) of IQN_Network has the same architecture as the encoder saved by Level 0 (autoencoder/SimCLR) and Level 1 BC pretraining. We can load a pretrained encoder.pt into img_head (config: pretrain_encoder_path). BC pretrain additionally trains the same CNN to predict actions from frames; the encoder is then transferred to IQN’s img_head. See Replay pretrain roadmap and BC pretraining.

Overview: inputs and outputs

This network does distributional RL: it models the distribution of the return (sum of rewards), not just its expectation. Standard DQN outputs one Q(s,a) = E[return]; here we predict quantiles of that distribution — for each τ ∈ (0,1) we get the τ-quantile (e.g. τ=0.1 = pessimistic, τ=0.5 = median). We get K values per (state, action); averaging them gives the usual Q(s,a), but the full set captures uncertainty and often improves learning (IQN, QR-DQN, C51 are distributional methods).

Quantiles τ — In IQN, for each τ ∈ (0,1) we predict the τ-quantile of the return distribution (e.g. τ=0.1 = “pessimistic” scenario, τ=0.5 = median, τ=0.9 = “optimistic”). The network is trained to match these quantiles via the quantile Huber loss. So we get K values Q(s,a,τ₁), …, Q(s,a,τₖ) per state and action instead of one; averaging them gives the usual Q(s,a), but the full set captures uncertainty.

Replication (“repeating” state K times) — We have one state representation after concat, shape (B, D). For each state we need Q for K different τ. So we repeat that representation K times → (B×K, D), and for each of the B×K rows we compute a τ-dependent embedding and mix it (Hadamard) with the repeated state. The result is (B×K, D): one row per (state, quantile). The A and V heads then output Q of shape (B×K, n_actions). So “replication” is: one state → K rows (one per τ) so we get K quantile estimates per state in one forward pass.

Dueling — We decompose Q(s,a) = V(s) + A(s,a), where V(s) is the state value and A(s,a) is the advantage of action a (we use Q = V + A - mean(A) so the decomposition is unique). In many states the value is similar across actions; learning V(s) once and small advantages per action is more sample-efficient than learning each Q(s,a) from scratch. See Dueling DQN (Wang et al., 2016).

Inputs:

  • img — Screen image tensor: shape (batch_size, 1, H, W), dtype float32/float16. Values are normalized in forward() as (img - 128) / 128 (if given as uint8, normalization is done in Inferer).

  • float_inputs — Vector of scalar state features (position, zones, previous actions, etc.): shape (batch_size, float_input_dim). Normalized in forward() as (float_inputs - mean) / std from config.

  • num_quantiles — Number of quantiles (N or N’ in the IQN paper), e.g. 8 during training.

  • tau (optional) — Tensor of quantiles with shape (batch_size * num_quantiles, 1). If not provided, quantiles are sampled inside the network (symmetrically around 0.5).

Outputs:

  • Q — Q-values for each (state, quantile): shape (batch_size * num_quantiles, n_actions).

  • tau — The quantiles used: shape (batch_size * num_quantiles, 1).

The network uses a dueling layout: from a single shared representation it computes value V and advantages A, then Q = V + A - mean(A).

High-level diagram (main blocks)

Below is a block diagram of the main components only: what enters the network and how data flows to the Q output.

digraph iqn_high_level {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=11];
   edge [fontname="Helvetica", fontsize=10];

   subgraph inputs {
     node [fillcolor=lightblue, style="filled"];
     img [label="img\n(B, 1, H, W)"];
     floats [label="float_inputs\n(B, float_input_dim)"];
     tau_in [label="τ (quantiles)\noptional"];
   }

   subgraph backbone {
     node [fillcolor=lightyellow, style="filled"];
     img_head [label="Image head\n(CNN)"];
     float_head [label="Float head\n(MLP)"];
     concat [label="Concat"];
     iqn_block [label="IQN: τ-embed × concat"];
     dueling [label="Dueling\nA_head + V_head"];
   }

   subgraph outputs {
     node [fillcolor=lightgreen, style="filled"];
     Q_out [label="Q, τ\n(B×K, n_actions)"];
   }

   img -> img_head;
   floats -> float_head;
   img_head -> concat [label="conv_out"];
   float_head -> concat [label="float_hidden"];
   concat -> iqn_block;
   tau_in -> iqn_block [style=dashed];
   iqn_block -> dueling;
   dueling -> Q_out;
}
  • Image head — CNN over the frame; outputs one vector per sample (details below).

  • Float head — Two-layer MLP over scalar features; output size matches the float branch dimension (details below).

  • Concat — Concatenation of the two heads’ outputs along the last axis; dimension = conv_head_output_dim + float_hidden_dim (this is dense_input_dimension in the code).

  • IQN block — Quantiles τ are turned into an embedding (cos + linear layer), then element-wise (Hadamard) product with the repeated concat; output shape (B×K, dense_input_dimension) (details below).

  • Dueling — From this representation the advantage head A and value head V are computed, then Q = V + A - mean(A) (details below).

Block details

Image head (CNN)

Four convolutions with LeakyReLU and Flatten. Channel sizes: 1 → 16 → 32 → 64 → 32. Output is one vector per sample (size depends on H, W; set in config via w_downsized, h_downsized, e.g. 64×64).

digraph img_head {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=10];
   in [label="(B, 1, H, W)", fillcolor=lightblue, style="filled"];
   c1 [label="Conv2d 4×4, s=2\n1→16"];
   c2 [label="Conv2d 4×4, s=2\n16→32"];
   c3 [label="Conv2d 3×3, s=2\n32→64"];
   c4 [label="Conv2d 3×3, s=1\n64→32"];
   flat [label="Flatten"];
   out [label="(B, conv_head_output_dim)", fillcolor=lightgreen, style="filled"];
   in -> c1 -> c2 -> c3 -> c4 -> flat -> out;
}

Each Conv2d is followed by LeakyReLU (inplace). Weights are initialized orthogonally with the appropriate gain for LeakyReLU.

Float head (MLP)

Two linear layers with LeakyReLU. Input normalization (mean/std) is applied in forward() before this head.

digraph float_head {
   rankdir=LR;
   node [shape=box, fontname="Helvetica", fontsize=10];
   in [label="(B, float_input_dim)\nnormalized", fillcolor=lightblue, style="filled"];
   l1 [label="Linear → float_hidden_dim"];
   l2 [label="Linear → float_hidden_dim"];
   out [label="(B, float_hidden_dim)", fillcolor=lightgreen, style="filled"];
   in -> l1 -> l2 -> out;
}

Config parameter: float_hidden_dim (e.g. 256). LeakyReLU after each Linear.

IQN: quantile embedding and mixing

Quantiles τ (shape (B×K, 1)) are mapped to an embedding using the IQN formula: cos(π · i · τ) for i = 1..iqn_embedding_dimension, then one linear layer + LeakyReLU to dimension dense_input_dimension. The state vector (concat) is repeated K times (one per quantile), then multiplied by the quantile embedding (Hadamard). The result is a representation of shape (B×K, dense_input_dimension) that depends on both state and τ.

digraph iqn_detail {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   tau [label="τ (B×K, 1)", fillcolor=lightblue, style="filled"];
   concat_in [label="concat (B, D)", fillcolor=lightblue, style="filled"];
   cos [label="cos(π·i·τ)\n(B×K, iqn_embed_dim)"];
   fc [label="Linear + LeakyReLU\n→ (B×K, D)"];
   repeat [label="repeat K times\n(B×K, D)"];
   hadamard [label="× (element-wise)"];
   out [label="(B×K, D)", fillcolor=lightgreen, style="filled"];
   tau -> cos -> fc;
   concat_in -> repeat;
   fc -> hadamard;
   repeat -> hadamard;
   hadamard -> out;
}

Dueling: A_head and V_head

From the shared representation (B×K, dense_input_dimension) two heads are computed:

  • A_head: Linear(D → dense_hidden_dimension//2) → LeakyReLU → Linear → (B×K, n_actions).

  • V_head: Linear(D → dense_hidden_dimension//2) → LeakyReLU → Linear → (B×K, 1).

Then: Q = V + A - mean(A, dim=actions). This yields Q-values for all actions and all quantiles.

digraph dueling_detail {
   rankdir=TB;
   node [shape=box, fontname="Helvetica", fontsize=10];
   in [label="(B×K, D)\nstate × quantile", fillcolor=lightyellow, style="filled"];
   a1 [label="Linear D→512"];
   a2 [label="Linear 512→n_actions"];
   v1 [label="Linear D→512"];
   v2 [label="Linear 512→1"];
   A [label="A (B×K, n_actions)"];
   V [label="V (B×K, 1)"];
   Q [label="Q = V + A - mean(A)", fillcolor=lightgreen, style="filled"];
   in -> a1 -> a2 -> A;
   in -> v1 -> v2 -> V;
   A -> Q;
   V -> Q;
}

Config parameter dense_hidden_dimension (e.g. 1024); then the inner layer of the heads is 512. The final layers of A_head and V_head are initialized orthogonally without extra gain.

Other implementation details

  • Prioritized replay — Optional (prio_alpha > 0): transitions are sampled with probability proportional to TD error; importance weights are applied to the loss so that the update remains unbiased.

  • Gradient clipping — We clip gradients by value (clip_grad_value) and by norm (clip_grad_norm) to avoid explosions.

  • Target self-loss clamping — We scale the per-sample loss so that the target’s “self-loss” (target vs target) does not dominate; this stabilizes quantile regression. See target_self_loss_clamp_ratio and the running averages in Trainer.train_on_batch.

  • Exploration — At inference we use ε-greedy or Boltzmann over the mean of the K quantile outputs (config: exploration section). So we still act on a single scalar Q per action, but that scalar is the average of the distributional output.

Config parameters

Main dimensions are set in config_files/rl/config_default.yaml (section neural_network):

  • w_downsized, h_downsized — Input frame size for the CNN (e.g. 64×64).

  • float_hidden_dim — Output size of the float head (256).

  • dense_hidden_dimension — Hidden size in the A and V heads (1024).

  • iqn_embedding_dimension — Dimension of the cos-embedding of quantiles (128).

  • iqn_n — Number of quantiles during training (8); iqn_k — during inference (32).

float_input_dim is computed at config load time (depends on number of zones, previous actions, etc.). conv_head_output_dim is computed from H, W via calculate_conv_output_dim() in iqn.py.

See also

  • IQN model experiments — Experiments on IQN variants (DDQN, embedding size, image size).

  • Main Objects — IQN_Network, buffer, rollout_results.

  • Getting started — How to run training and what to expect.

  • Configuration Guide — All config options (neural_network, training, rewards, etc.).

  • The IQN_Network class and forward() method in trackmania_rl.agents.iqn.