TensorBoard Metrics Reference

This document provides a comprehensive guide to all metrics logged to TensorBoard during training. Metrics are organized into groups for easier navigation.

Overview

All metrics are logged with prefixes that group them into categories: - Training/ - Training process metrics - RL/ - Reinforcement learning metrics - Race/ - Race performance metrics - Gradients/ - Gradient monitoring - Performance/ - System performance metrics - Buffer/ - Replay buffer statistics - Network/ - Neural network weights and optimizer state - IQN/ - IQN-specific metrics

Training Metrics

Training/loss

Description: Training loss computed on batches from the replay buffer.

Interpretation: - In reinforcement learning, loss increasing early in training is normal and expected - This indicates the agent is discovering the environment and identifying inconsistencies in its value estimates - Loss should stabilize or decrease after ~1-2M frames - Values typically range from 0.01 to 10.0

What to watch for: - Sudden spikes (>100) may indicate gradient explosions - Consistently increasing loss after 5M+ frames may indicate learning issues

Training/loss_test

Description: Test loss computed on held-out test buffer (not used for training).

Interpretation: - Should track training loss but be slightly higher - Large gap between training and test loss indicates overfitting - Useful for detecting when the model is memorizing rather than generalizing

What to watch for: - Test loss much higher than training loss (>2x) suggests overfitting - Test loss decreasing while training loss increases suggests good generalization

Training/learning_rate

Description: Current learning rate used by the optimizer.

Interpretation: - Decays according to the learning rate schedule - Typical range: 1e-5 to 1e-3 - Lower learning rates in later training allow fine-tuning

What to watch for: - Should decrease smoothly over time - Abrupt changes indicate schedule issues

Training/weight_decay

Description: L2 regularization strength (weight decay coefficient).

Interpretation: - Prevents overfitting by penalizing large weights - Typically proportional to learning rate - Range: 1e-7 to 1e-5

What to watch for: - Should track learning rate if using proportional weight decay - Too high values can prevent learning

Training/batch_size

Description: Number of transitions sampled per training batch.

Interpretation: - Larger batches provide more stable gradients but slower updates - Typical values: 32, 64, 128, 256

What to watch for: - Should remain constant unless explicitly changed in config

Training/n_steps

Description: N-step return horizon for bootstrapping.

Interpretation: - Number of steps used in n-step returns - Higher values reduce bias but increase variance - Typical range: 1-5

What to watch for: - Should remain constant unless explicitly changed in config

Training/discard_non_greedy_actions_in_nsteps

Description: Whether non-greedy (exploratory) actions are excluded from n-step returns.

Interpretation: - 1.0 = True (only greedy actions in n-step backup) - 0.0 = False (all actions included) - Recommended: True to reduce exploration bias

What to watch for: - Should remain constant unless explicitly changed in config

Training/train_on_batch_duration

Description: Median time (in seconds) to process one training batch.

Interpretation: - Lower is better (faster training) - Typical range: 0.01-0.1 seconds - Affected by GPU speed, batch size, and network complexity

What to watch for: - Sudden increases may indicate GPU throttling or system issues - Should be relatively stable

RL Metrics

RL/avg_Q

Description: Average Q-value (expected future reward) predicted by the network.

Interpretation: - Key indicator of learning progress - Starts near zero for untrained agent - Initially decreases as agent discovers it plays poorly - Should increase as agent learns better strategies - Higher values indicate agent expects more reward

What to watch for: - Should trend upward after initial exploration phase (~500K-1M frames) - Plateaus indicate agent has learned current strategy - Decreasing values may indicate learning instability

RL/single_zone_reached

Description: Furthest virtual checkpoint (zone) reached during a race, as percentage of track.

Interpretation: - 0.0 = agent didn’t start - 1.0 = agent finished the track - Shows how far agent progresses along the track

What to watch for: - Should increase over time - Takes ~300K steps to learn to press forward - Takes ~500K steps to finish map for first time - Takes ~1M steps to regularly finish map - Plateaus indicate agent is stuck at certain sections

RL/gamma

Description: Discount factor for future rewards.

Interpretation: - Controls how much future rewards are valued - Range: 0.0 (only immediate reward) to 1.0 (all future rewards equally) - Typically increases from 0.999 to 1.0 during training - Higher values make agent plan further ahead

What to watch for: - Should increase according to schedule - Too low values make agent short-sighted - Too high values (1.0) can cause instability

RL/epsilon

Description: Epsilon-greedy exploration rate.

Interpretation: - Probability of taking random action instead of greedy action - Decays from 1.0 (fully random) to ~0.03 (mostly greedy) - Higher values = more exploration - Lower values = more exploitation

What to watch for: - Should decay smoothly according to schedule - Too fast decay = insufficient exploration - Too slow decay = agent doesn’t exploit learned strategies

RL/epsilon_boltzmann

Description: Boltzmann exploration temperature parameter.

Interpretation: - Controls softmax temperature for action selection - Higher values = more uniform action distribution (more exploration) - Lower values = more peaked distribution (more exploitation) - Used in combination with epsilon-greedy

What to watch for: - Should decay according to schedule - Works together with epsilon for exploration strategy

RL/tau_epsilon_boltzmann

Description: Tau parameter for Boltzmann exploration.

Interpretation: - Additional temperature parameter for IQN quantile sampling - Affects exploration in distributional RL setting - Typically constant value

What to watch for: - Should remain constant unless explicitly changed

RL/mean_action_gap

Description: Average difference between best Q-value and other Q-values per state.

Interpretation: - Measures how confident the agent is in its action selection - Higher values = agent has clear preference for one action - Lower values = agent is uncertain between actions - Negative values are possible (computed as negative gap)

What to watch for: - Should increase as agent learns (becomes more confident) - Very low values indicate high uncertainty

RL/q_value_{i}_starting_frame

Description: Q-value for action {i} at the starting frame of a race.

Interpretation: - Shows agent’s expected reward for each action at race start - Useful for understanding initial action preferences - Typically logged for action 0 (forward)

What to watch for: - Should increase as agent learns - Can reveal if agent has learned good starting strategy

Race Metrics

Race/eval_race_time_robust

Description: Most important performance metric! Best evaluation race times (greedy policy, no exploration).

Interpretation: - Time in seconds for evaluation runs that finished within 2% of rolling mean - Only includes “robust” runs (consistent performance) - Lower is better - This is the primary metric to track for agent performance

What to watch for: - Should decrease over time (agent getting faster) - Plateaus indicate agent has learned current strategy - Compare with reference times (author/gold) if available - Most reliable indicator of actual performance

Race/eval_race_time_{status}_{map}

Description: Evaluation race time for specific map and status.

Interpretation: - Time in seconds for evaluation runs - Includes all evaluation runs (not just robust ones) - More variable than robust times - Status indicates run quality (e.g., “finished”, “dnf”)

What to watch for: - More noisy than robust times - Useful for tracking completion rates

Race/explo_race_time_finished

Description: Exploration race times for runs that finished.

Interpretation: - Time in seconds for exploration runs that completed the track - Includes exploration, so more variable than evaluation times - Higher than evaluation times (exploration slows agent down)

What to watch for: - Should trend downward but be more noisy - Useful for tracking exploration progress - Large gap with eval times indicates exploration is working

Race/explo_race_time_{status}_{map}

Description: Exploration race time for specific map and status.

Interpretation: - Time in seconds for exploration runs - Includes all exploration runs - More variable due to exploration

What to watch for: - More noisy than finished times - Useful for understanding exploration behavior

Race/eval_race_finished_{status}_{map}

Description: Whether evaluation race finished (1.0) or not (0.0).

Interpretation: - Binary metric: 1.0 = finished, 0.0 = did not finish - Shows completion rate for evaluation runs - Should approach 1.0 as agent learns

What to watch for: - Should increase to 1.0 as training progresses - Persistent 0.0 values indicate agent is stuck

Race/explo_race_finished_{status}_{map}

Description: Whether exploration race finished (1.0) or not (0.0).

Interpretation: - Binary metric: 1.0 = finished, 0.0 = did not finish - Shows completion rate for exploration runs - May be lower than eval completion rate

What to watch for: - Should increase over time - Lower than eval rate is normal (exploration can cause crashes)

Race/race_time_ratio_{map}

Description: Ratio of race time to total rollout duration.

Interpretation: - Shows efficiency: how much of rollout time was spent racing - Values < 1.0 indicate time spent on loading, setup, etc. - Higher values = more efficient data collection

What to watch for: - Should be relatively stable - Very low values indicate system overhead issues

Race/split_{map}_{i}

Description: Time (in seconds) between checkpoint i and checkpoint i+1.

Interpretation: - Shows performance on specific track segments - Useful for identifying which parts of track are slow - Only logged for evaluation runs

What to watch for: - Should decrease over time for all splits - Large differences between splits indicate difficult sections - Useful for track-specific analysis

Race/eval_ratio_{status}_{reference}_{map}

Description: Race time as percentage of reference time (author or gold).

Interpretation: - 100% = matched reference time - <100% = faster than reference (rare, indicates very good performance) - >100% = slower than reference - Useful for comparing to human performance

What to watch for: - Should decrease over time (approaching 100% or below) - Only available if reference times are configured

Race/eval_agg_ratio_{status}_{reference}

Description: Aggregated ratio across all maps.

Interpretation: - Average ratio across all maps with reference times - Useful for multi-map training

What to watch for: - Should decrease over time - Only available if reference times are configured

Gradient Metrics

Gradients/norm_median

Description: Median gradient norm after clipping.

Interpretation: - Should be stable (typically <30) - Shows typical gradient magnitude - Stable values indicate healthy training

What to watch for: - Should remain relatively constant - Sudden changes may indicate learning issues

Gradients/norm_q1, norm_q3

Description: 25th and 75th percentile gradient norms after clipping.

Interpretation: - Shows distribution of gradient magnitudes - Q1-Q3 range shows typical gradient spread - Useful for understanding gradient stability

What to watch for: - Should be relatively stable - Large spread may indicate unstable gradients

Gradients/norm_d9, norm_d98

Description: 90th and 98th percentile gradient norms after clipping.

Interpretation: - Shows tail of gradient distribution - Higher percentiles reveal occasional large gradients - Useful for detecting outliers

What to watch for: - Should be stable - Large values may indicate occasional gradient spikes

Gradients/norm_max

Description: Maximum gradient norm after clipping.

Interpretation: - Maximum gradient magnitude encountered - After clipping, should be bounded by clip value - Typical range: 10-50

What to watch for: - Should be relatively stable - Consistently hitting clip value may indicate need for higher clip threshold

Gradients/norm_before_clip_median

Description: Median gradient norm BEFORE clipping.

Interpretation: - Shows typical gradient magnitude before clipping - Should be similar to after-clip median if clipping is not active - Useful for understanding if clipping is necessary

What to watch for: - Should be stable - Much higher than after-clip indicates clipping is active

Gradients/norm_before_clip_max

Description: CRITICAL METRIC! Maximum gradient norm BEFORE clipping.

Interpretation: - Watch this closely! Values >100 indicate gradient explosions - Should typically be <50 - Sudden spikes indicate training instability - Used to detect gradient explosion before clipping fixes it

What to watch for: - Most important gradient metric - Values >100 = gradient explosion (bad!) - Values >200 = severe gradient explosion - Sudden spikes require investigation - Should be relatively stable

Gradients/norm_before_clip_q1, q3, d9, d98

Description: Percentile gradient norms before clipping.

Interpretation: - Shows distribution of unclipped gradients - Useful for understanding gradient behavior before clipping - Similar interpretation to after-clip percentiles

What to watch for: - Should be stable - Large values indicate need for gradient clipping

Gradients/by_layer/{layer_name}/L2_median, q3, d9, max

Description: Per-layer L2 gradient norms (Euclidean norm).

Interpretation: - Shows gradient magnitude for each network layer - Useful for debugging which layers have gradient issues - L2 norm = sqrt(sum of squared gradients)

What to watch for: - Some layers may have naturally larger gradients - Sudden spikes in specific layers indicate layer-specific issues - Useful for identifying problematic layers

Gradients/by_layer/{layer_name}/Linf_median, q3, d9, max

Description: Per-layer Linf gradient norms (maximum absolute value).

Interpretation: - Shows maximum gradient component for each layer - Useful for detecting individual parameter issues - Linf norm = max absolute gradient value

What to watch for: - Can reveal issues in specific parameters - Large Linf with small L2 indicates sparse large gradients

Performance Metrics

Performance/transitions_learned_per_second

Description: Training throughput - number of transitions processed per second.

Interpretation: - Higher is better (faster training) - Typical range: 100-1000 transitions/second - Affected by GPU speed, batch size, and system performance

What to watch for: - Should be relatively stable - Sudden decreases may indicate system issues - Higher values = faster training progress

Performance/learner_percentage_training

Description: Percentage of time learner process spends on training (vs waiting).

Interpretation: - Should be high (>70%) for efficient training - Low values indicate learner is waiting for data - High values indicate good data collection rate

What to watch for: - Should be >70% for efficient training - <50% indicates workers are too slow - 100% indicates perfect balance (rare)

Performance/learner_percentage_waiting_for_workers

Description: Percentage of time learner process waits for worker data.

Interpretation: - Should be low (<20%) for efficient training - High values indicate workers are too slow - Indicates data collection bottleneck

What to watch for: - Should be <20% for efficient training - >50% indicates severe data collection bottleneck - May need more worker instances or faster workers

Performance/learner_percentage_testing

Description: Percentage of time spent on test batches.

Interpretation: - Typically small (<10%) - Time spent evaluating on test buffer - Useful for monitoring but not critical

What to watch for: - Should be relatively small - Large values may indicate too much testing

Performance/instrumentation__answer_normal_step

Description: Time spent in normal step processing (microseconds).

Interpretation: - Low-level performance metric - Shows TMInterface communication overhead - Useful for debugging performance issues

What to watch for: - Should be relatively stable - Sudden increases may indicate system issues

Performance/instrumentation__answer_action_step

Description: Time spent in action step processing (microseconds).

Interpretation: - Low-level performance metric - Shows action processing time - Useful for debugging performance issues

What to watch for: - Should be relatively stable - Affects overall training speed

Performance/instrumentation__between_run_steps

Description: Time spent between runs (microseconds).

Interpretation: - Low-level performance metric - Shows overhead between race restarts - Includes map loading, reset, etc.

What to watch for: - Should be relatively stable - Large values indicate slow map loading

Performance/instrumentation__grab_frame

Description: Time spent grabbing frame from game (microseconds).

Interpretation: - Low-level performance metric - Shows frame capture overhead - Affected by game rendering speed

What to watch for: - Should be relatively stable - Large values may indicate rendering issues

Performance/instrumentation__convert_frame

Description: Time spent converting frame format (microseconds).

Interpretation: - Low-level performance metric - Shows image processing overhead - Affected by image resolution and format

What to watch for: - Should be relatively stable - Can be optimized by reducing resolution

Performance/instrumentation__grab_floats

Description: Time spent grabbing float data from game (microseconds).

Interpretation: - Low-level performance metric - Shows data extraction overhead - Includes speed, position, etc.

What to watch for: - Should be relatively stable - Typically very fast

Performance/instrumentation__exploration_policy

Description: Time spent in exploration policy computation (microseconds).

Interpretation: - Low-level performance metric - Shows action selection overhead - Includes Q-value computation and exploration

What to watch for: - Should be relatively stable - Affected by network inference speed

Performance/instrumentation__request_inputs_and_speed

Description: Time spent requesting inputs and speed from game (microseconds).

Interpretation: - Low-level performance metric - Shows game communication overhead - Includes TMInterface API calls

What to watch for: - Should be relatively stable - Large values may indicate communication issues

Performance/tmi_protection_cutoff

Description: Number of times TMI protection cutoff was triggered.

Interpretation: - Safety mechanism to prevent infinite loops - High values indicate agent is getting stuck frequently - Should be low for well-trained agent

What to watch for: - Should decrease as agent learns - High values indicate learning issues - May need to adjust timeout settings

Performance/worker_time_in_rollout_percentage

Description: Percentage of rollout time spent in worker processing.

Interpretation: - Shows worker efficiency - Higher values = workers are busy (good) - Lower values = workers are waiting (bad)

What to watch for: - Should be relatively high (>80%) - Low values indicate worker bottlenecks

Buffer Metrics

Buffer/size

Description: Current number of transitions in replay buffer.

Interpretation: - Grows from 0 to max_size during training - More transitions = more diverse training data - Typical range: 20K to 200K

What to watch for: - Should increase until reaching max_size - Should remain at max_size once full - Sudden decreases may indicate buffer issues

Buffer/max_size

Description: Maximum capacity of replay buffer.

Interpretation: - Set by memory_size_schedule - Larger buffers = more memory but better diversity - Typical range: 50K to 200K

What to watch for: - Should remain constant unless schedule changes - Changes according to memory_size_schedule

Buffer/number_times_single_memory_is_used_before_discard

Description: How many times each transition is used before being discarded.

Interpretation: - Controls transition reuse - Higher values = transitions used more times - Balances data efficiency with freshness

What to watch for: - Should remain constant unless explicitly changed - Typical values: 1-4

Buffer/priorities_min, q1, mean, median, q3, d9, c98, max

Description: Priority statistics for prioritized experience replay.

Interpretation: - Only available if using prioritized replay (prio_alpha > 0) - Higher priorities = more important transitions - Priorities based on TD error - Shows distribution of transition importance

What to watch for: - Large spread indicates some transitions are much more important - Should be relatively stable - Not available if using uniform sampling (prio_alpha = 0)

Network Metrics

Network/weights/{layer_name}/L2

Description: L2 norm (Euclidean norm) of layer weights.

Interpretation: - Shows magnitude of weights in each layer - Useful for detecting weight growth or decay - Should be relatively stable during training

What to watch for: - Sudden increases may indicate instability - Gradual growth is normal - Very large values may indicate numerical issues

Network/optimizer/{layer_name}/adaptive_lr_L2

Description: L2 norm of per-parameter adaptive learning rates (Adam/RAdam).

Interpretation: - Shows magnitude of adaptive learning rates - Adam/RAdam adjust learning rate per parameter - Higher values = larger effective learning rates

What to watch for: - Should be relatively stable - Useful for understanding optimizer behavior

Network/optimizer/{layer_name}/exp_avg_L2

Description: L2 norm of first moment estimate (Adam/RAdam).

Interpretation: - First moment (moving average of gradients) - Used by Adam/RAdam for momentum - Should track gradient magnitudes

What to watch for: - Should be relatively stable - Useful for debugging optimizer state

Network/optimizer/{layer_name}/exp_avg_sq_L2

Description: L2 norm of second moment estimate (Adam/RAdam).

Interpretation: - Second moment (moving average of squared gradients) - Used by Adam/RAdam for adaptive learning rates - Should track gradient variance

What to watch for: - Should be relatively stable - Useful for debugging optimizer state

IQN Metrics

IQN/quantile_std_action_{i}

Description: Standard deviation of quantile predictions for action {i}.

Interpretation: - Measures uncertainty in Q-value estimates for each action - Higher values = more uncertainty (wider distribution) - Lower values = more confidence (narrower distribution) - IQN-specific metric (distributional RL)

What to watch for: - Should decrease as agent learns (becomes more confident) - High values indicate high uncertainty - Useful for understanding model confidence - Different actions may have different uncertainty levels

Other Metrics

alltime_min_ms_{map}

Description: All-time best race time (in milliseconds) for each map.

Interpretation: - Best time ever achieved on each map - Only decreases (new records) - Most important performance metric alongside eval_race_time_robust

What to watch for: - Should decrease over time (new records) - Plateaus indicate agent has reached current limit - Compare with reference times if available

cumul_number_frames_played

Description: Cumulative number of frames processed during training.

Interpretation: - Total training progress - Used as x-axis in most TensorBoard plots - Typical training: 1M to 50M+ frames

What to watch for: - Should increase steadily - Used to track training progress

cumul_number_batches_done

Description: Cumulative number of training batches processed.

Interpretation: - Total number of gradient updates - Related to frames_played but depends on buffer fill rate - Higher = more learning steps

What to watch for: - Should increase steadily - Ratio to frames_played shows learning frequency

cumul_number_single_memories_used

Description: Cumulative number of transitions used for training.

Interpretation: - Total transitions sampled from buffer - May be higher than frames_played due to reuse - Shows total learning experience

What to watch for: - Should increase steadily - Higher than frames_played indicates transition reuse

cumul_number_memories_generated

Description: Cumulative number of transitions generated from rollouts.

Interpretation: - Total transitions added to buffer - Includes n-step transitions - Shows data collection progress

What to watch for: - Should increase steadily - Should be less than memories_used (due to reuse)

cumul_training_hours

Description: Cumulative training time in hours.

Interpretation: - Total wall-clock time spent training - Useful for estimating training duration - Includes all overhead (not just GPU time)

What to watch for: - Should increase steadily - Useful for planning training schedules

cumul_number_target_network_updates

Description: Cumulative number of target network updates.

Interpretation: - Number of times target network was updated - Target network updated less frequently than online network - Used for stable Q-learning

What to watch for: - Should increase steadily - Frequency depends on update schedule

times_summary (Text)

Description: Text summary of best times for all maps.

Interpretation: - Human-readable summary of performance - Shows best times with timestamps - Updated every 5 minutes

What to watch for: - Useful for quick overview - Shows new records with ** markers

Tips for Using TensorBoard

  1. Filtering: Use the search box in TensorBoard to filter metrics by prefix (e.g., type “Gradients/” to see all gradient metrics)

  2. Custom Scalars: The “Custom Scalars” tab has pre-configured layouts for key metrics grouped together

  3. Smoothing: Use the smoothing slider to reduce noise in plots (helpful for noisy metrics)

  4. Comparison: Load multiple runs to compare different training configurations

  5. Key Metrics to Monitor: - Race/eval_race_time_robust - Primary performance metric - RL/avg_Q - Learning progress indicator - Gradients/norm_before_clip_max - Training stability - Training/loss - Learning quality - Performance/transitions_learned_per_second - Training efficiency

  6. Early Training (0-3M frames): - Watch RL/single_zone_reached - should increase to 1.0 - Watch RL/avg_Q - may decrease then increase - Watch Training/loss - may increase (normal!)

  7. Mid Training (3-10M frames): - Watch Race/eval_race_time_robust - should decrease - Watch RL/avg_Q - should increase - Watch Training/loss - should stabilize

  8. Late Training (10M+ frames): - Watch Race/eval_race_time_robust - slow improvements - Watch for plateaus - may need longer training or hyperparameter changes