TensorBoard Metrics Reference
This document provides a comprehensive guide to all metrics logged to TensorBoard during training. Metrics are organized into groups for easier navigation.
Overview
All metrics are logged with prefixes that group them into categories:
- Training/ - Training process metrics
- RL/ - Reinforcement learning metrics
- Race/ - Race performance metrics
- Gradients/ - Gradient monitoring
- Performance/ - System performance metrics
- Buffer/ - Replay buffer statistics
- Network/ - Neural network weights and optimizer state
- IQN/ - IQN-specific metrics
Training Metrics
- Training/loss
Description: Training loss computed on batches from the replay buffer.
Interpretation: - In reinforcement learning, loss increasing early in training is normal and expected - This indicates the agent is discovering the environment and identifying inconsistencies in its value estimates - Loss should stabilize or decrease after ~1-2M frames - Values typically range from 0.01 to 10.0
What to watch for: - Sudden spikes (>100) may indicate gradient explosions - Consistently increasing loss after 5M+ frames may indicate learning issues
- Training/loss_test
Description: Test loss computed on held-out test buffer (not used for training).
Interpretation: - Should track training loss but be slightly higher - Large gap between training and test loss indicates overfitting - Useful for detecting when the model is memorizing rather than generalizing
What to watch for: - Test loss much higher than training loss (>2x) suggests overfitting - Test loss decreasing while training loss increases suggests good generalization
- Training/learning_rate
Description: Current learning rate used by the optimizer.
Interpretation: - Decays according to the learning rate schedule - Typical range: 1e-5 to 1e-3 - Lower learning rates in later training allow fine-tuning
What to watch for: - Should decrease smoothly over time - Abrupt changes indicate schedule issues
- Training/weight_decay
Description: L2 regularization strength (weight decay coefficient).
Interpretation: - Prevents overfitting by penalizing large weights - Typically proportional to learning rate - Range: 1e-7 to 1e-5
What to watch for: - Should track learning rate if using proportional weight decay - Too high values can prevent learning
- Training/batch_size
Description: Number of transitions sampled per training batch.
Interpretation: - Larger batches provide more stable gradients but slower updates - Typical values: 32, 64, 128, 256
What to watch for: - Should remain constant unless explicitly changed in config
- Training/n_steps
Description: N-step return horizon for bootstrapping.
Interpretation: - Number of steps used in n-step returns - Higher values reduce bias but increase variance - Typical range: 1-5
What to watch for: - Should remain constant unless explicitly changed in config
- Training/discard_non_greedy_actions_in_nsteps
Description: Whether non-greedy (exploratory) actions are excluded from n-step returns.
Interpretation: - 1.0 = True (only greedy actions in n-step backup) - 0.0 = False (all actions included) - Recommended: True to reduce exploration bias
What to watch for: - Should remain constant unless explicitly changed in config
- Training/train_on_batch_duration
Description: Median time (in seconds) to process one training batch.
Interpretation: - Lower is better (faster training) - Typical range: 0.01-0.1 seconds - Affected by GPU speed, batch size, and network complexity
What to watch for: - Sudden increases may indicate GPU throttling or system issues - Should be relatively stable
RL Metrics
- RL/avg_Q
Description: Average Q-value (expected future reward) predicted by the network.
Interpretation: - Key indicator of learning progress - Starts near zero for untrained agent - Initially decreases as agent discovers it plays poorly - Should increase as agent learns better strategies - Higher values indicate agent expects more reward
What to watch for: - Should trend upward after initial exploration phase (~500K-1M frames) - Plateaus indicate agent has learned current strategy - Decreasing values may indicate learning instability
- RL/single_zone_reached
Description: Furthest virtual checkpoint (zone) reached during a race, as percentage of track.
Interpretation: - 0.0 = agent didn’t start - 1.0 = agent finished the track - Shows how far agent progresses along the track
What to watch for: - Should increase over time - Takes ~300K steps to learn to press forward - Takes ~500K steps to finish map for first time - Takes ~1M steps to regularly finish map - Plateaus indicate agent is stuck at certain sections
- RL/gamma
Description: Discount factor for future rewards.
Interpretation: - Controls how much future rewards are valued - Range: 0.0 (only immediate reward) to 1.0 (all future rewards equally) - Typically increases from 0.999 to 1.0 during training - Higher values make agent plan further ahead
What to watch for: - Should increase according to schedule - Too low values make agent short-sighted - Too high values (1.0) can cause instability
- RL/epsilon
Description: Epsilon-greedy exploration rate.
Interpretation: - Probability of taking random action instead of greedy action - Decays from 1.0 (fully random) to ~0.03 (mostly greedy) - Higher values = more exploration - Lower values = more exploitation
What to watch for: - Should decay smoothly according to schedule - Too fast decay = insufficient exploration - Too slow decay = agent doesn’t exploit learned strategies
- RL/epsilon_boltzmann
Description: Boltzmann exploration temperature parameter.
Interpretation: - Controls softmax temperature for action selection - Higher values = more uniform action distribution (more exploration) - Lower values = more peaked distribution (more exploitation) - Used in combination with epsilon-greedy
What to watch for: - Should decay according to schedule - Works together with epsilon for exploration strategy
- RL/tau_epsilon_boltzmann
Description: Tau parameter for Boltzmann exploration.
Interpretation: - Additional temperature parameter for IQN quantile sampling - Affects exploration in distributional RL setting - Typically constant value
What to watch for: - Should remain constant unless explicitly changed
- RL/mean_action_gap
Description: Average difference between best Q-value and other Q-values per state.
Interpretation: - Measures how confident the agent is in its action selection - Higher values = agent has clear preference for one action - Lower values = agent is uncertain between actions - Negative values are possible (computed as negative gap)
What to watch for: - Should increase as agent learns (becomes more confident) - Very low values indicate high uncertainty
- RL/q_value_{i}_starting_frame
Description: Q-value for action {i} at the starting frame of a race.
Interpretation: - Shows agent’s expected reward for each action at race start - Useful for understanding initial action preferences - Typically logged for action 0 (forward)
What to watch for: - Should increase as agent learns - Can reveal if agent has learned good starting strategy
Race Metrics
- Race/eval_race_time_robust
Description: Most important performance metric! Best evaluation race times (greedy policy, no exploration).
Interpretation: - Time in seconds for evaluation runs that finished within 2% of rolling mean - Only includes “robust” runs (consistent performance) - Lower is better - This is the primary metric to track for agent performance
What to watch for: - Should decrease over time (agent getting faster) - Plateaus indicate agent has learned current strategy - Compare with reference times (author/gold) if available - Most reliable indicator of actual performance
- Race/eval_race_time_{status}_{map}
Description: Evaluation race time for specific map and status.
Interpretation: - Time in seconds for evaluation runs - Includes all evaluation runs (not just robust ones) - More variable than robust times - Status indicates run quality (e.g., “finished”, “dnf”)
What to watch for: - More noisy than robust times - Useful for tracking completion rates
- Race/explo_race_time_finished
Description: Exploration race times for runs that finished.
Interpretation: - Time in seconds for exploration runs that completed the track - Includes exploration, so more variable than evaluation times - Higher than evaluation times (exploration slows agent down)
What to watch for: - Should trend downward but be more noisy - Useful for tracking exploration progress - Large gap with eval times indicates exploration is working
- Race/explo_race_time_{status}_{map}
Description: Exploration race time for specific map and status.
Interpretation: - Time in seconds for exploration runs - Includes all exploration runs - More variable due to exploration
What to watch for: - More noisy than finished times - Useful for understanding exploration behavior
- Race/eval_race_finished_{status}_{map}
Description: Whether evaluation race finished (1.0) or not (0.0).
Interpretation: - Binary metric: 1.0 = finished, 0.0 = did not finish - Shows completion rate for evaluation runs - Should approach 1.0 as agent learns
What to watch for: - Should increase to 1.0 as training progresses - Persistent 0.0 values indicate agent is stuck
- Race/explo_race_finished_{status}_{map}
Description: Whether exploration race finished (1.0) or not (0.0).
Interpretation: - Binary metric: 1.0 = finished, 0.0 = did not finish - Shows completion rate for exploration runs - May be lower than eval completion rate
What to watch for: - Should increase over time - Lower than eval rate is normal (exploration can cause crashes)
- Race/race_time_ratio_{map}
Description: Ratio of race time to total rollout duration.
Interpretation: - Shows efficiency: how much of rollout time was spent racing - Values < 1.0 indicate time spent on loading, setup, etc. - Higher values = more efficient data collection
What to watch for: - Should be relatively stable - Very low values indicate system overhead issues
- Race/split_{map}_{i}
Description: Time (in seconds) between checkpoint i and checkpoint i+1.
Interpretation: - Shows performance on specific track segments - Useful for identifying which parts of track are slow - Only logged for evaluation runs
What to watch for: - Should decrease over time for all splits - Large differences between splits indicate difficult sections - Useful for track-specific analysis
- Race/eval_ratio_{status}_{reference}_{map}
Description: Race time as percentage of reference time (author or gold).
Interpretation: - 100% = matched reference time - <100% = faster than reference (rare, indicates very good performance) - >100% = slower than reference - Useful for comparing to human performance
What to watch for: - Should decrease over time (approaching 100% or below) - Only available if reference times are configured
- Race/eval_agg_ratio_{status}_{reference}
Description: Aggregated ratio across all maps.
Interpretation: - Average ratio across all maps with reference times - Useful for multi-map training
What to watch for: - Should decrease over time - Only available if reference times are configured
Gradient Metrics
- Gradients/norm_median
Description: Median gradient norm after clipping.
Interpretation: - Should be stable (typically <30) - Shows typical gradient magnitude - Stable values indicate healthy training
What to watch for: - Should remain relatively constant - Sudden changes may indicate learning issues
- Gradients/norm_q1, norm_q3
Description: 25th and 75th percentile gradient norms after clipping.
Interpretation: - Shows distribution of gradient magnitudes - Q1-Q3 range shows typical gradient spread - Useful for understanding gradient stability
What to watch for: - Should be relatively stable - Large spread may indicate unstable gradients
- Gradients/norm_d9, norm_d98
Description: 90th and 98th percentile gradient norms after clipping.
Interpretation: - Shows tail of gradient distribution - Higher percentiles reveal occasional large gradients - Useful for detecting outliers
What to watch for: - Should be stable - Large values may indicate occasional gradient spikes
- Gradients/norm_max
Description: Maximum gradient norm after clipping.
Interpretation: - Maximum gradient magnitude encountered - After clipping, should be bounded by clip value - Typical range: 10-50
What to watch for: - Should be relatively stable - Consistently hitting clip value may indicate need for higher clip threshold
- Gradients/norm_before_clip_median
Description: Median gradient norm BEFORE clipping.
Interpretation: - Shows typical gradient magnitude before clipping - Should be similar to after-clip median if clipping is not active - Useful for understanding if clipping is necessary
What to watch for: - Should be stable - Much higher than after-clip indicates clipping is active
- Gradients/norm_before_clip_max
Description: CRITICAL METRIC! Maximum gradient norm BEFORE clipping.
Interpretation: - Watch this closely! Values >100 indicate gradient explosions - Should typically be <50 - Sudden spikes indicate training instability - Used to detect gradient explosion before clipping fixes it
What to watch for: - Most important gradient metric - Values >100 = gradient explosion (bad!) - Values >200 = severe gradient explosion - Sudden spikes require investigation - Should be relatively stable
- Gradients/norm_before_clip_q1, q3, d9, d98
Description: Percentile gradient norms before clipping.
Interpretation: - Shows distribution of unclipped gradients - Useful for understanding gradient behavior before clipping - Similar interpretation to after-clip percentiles
What to watch for: - Should be stable - Large values indicate need for gradient clipping
- Gradients/by_layer/{layer_name}/L2_median, q3, d9, max
Description: Per-layer L2 gradient norms (Euclidean norm).
Interpretation: - Shows gradient magnitude for each network layer - Useful for debugging which layers have gradient issues - L2 norm = sqrt(sum of squared gradients)
What to watch for: - Some layers may have naturally larger gradients - Sudden spikes in specific layers indicate layer-specific issues - Useful for identifying problematic layers
- Gradients/by_layer/{layer_name}/Linf_median, q3, d9, max
Description: Per-layer Linf gradient norms (maximum absolute value).
Interpretation: - Shows maximum gradient component for each layer - Useful for detecting individual parameter issues - Linf norm = max absolute gradient value
What to watch for: - Can reveal issues in specific parameters - Large Linf with small L2 indicates sparse large gradients
Performance Metrics
- Performance/transitions_learned_per_second
Description: Training throughput - number of transitions processed per second.
Interpretation: - Higher is better (faster training) - Typical range: 100-1000 transitions/second - Affected by GPU speed, batch size, and system performance
What to watch for: - Should be relatively stable - Sudden decreases may indicate system issues - Higher values = faster training progress
- Performance/learner_percentage_training
Description: Percentage of time learner process spends on training (vs waiting).
Interpretation: - Should be high (>70%) for efficient training - Low values indicate learner is waiting for data - High values indicate good data collection rate
What to watch for: - Should be >70% for efficient training - <50% indicates workers are too slow - 100% indicates perfect balance (rare)
- Performance/learner_percentage_waiting_for_workers
Description: Percentage of time learner process waits for worker data.
Interpretation: - Should be low (<20%) for efficient training - High values indicate workers are too slow - Indicates data collection bottleneck
What to watch for: - Should be <20% for efficient training - >50% indicates severe data collection bottleneck - May need more worker instances or faster workers
- Performance/learner_percentage_testing
Description: Percentage of time spent on test batches.
Interpretation: - Typically small (<10%) - Time spent evaluating on test buffer - Useful for monitoring but not critical
What to watch for: - Should be relatively small - Large values may indicate too much testing
- Performance/instrumentation__answer_normal_step
Description: Time spent in normal step processing (microseconds).
Interpretation: - Low-level performance metric - Shows TMInterface communication overhead - Useful for debugging performance issues
What to watch for: - Should be relatively stable - Sudden increases may indicate system issues
- Performance/instrumentation__answer_action_step
Description: Time spent in action step processing (microseconds).
Interpretation: - Low-level performance metric - Shows action processing time - Useful for debugging performance issues
What to watch for: - Should be relatively stable - Affects overall training speed
- Performance/instrumentation__between_run_steps
Description: Time spent between runs (microseconds).
Interpretation: - Low-level performance metric - Shows overhead between race restarts - Includes map loading, reset, etc.
What to watch for: - Should be relatively stable - Large values indicate slow map loading
- Performance/instrumentation__grab_frame
Description: Time spent grabbing frame from game (microseconds).
Interpretation: - Low-level performance metric - Shows frame capture overhead - Affected by game rendering speed
What to watch for: - Should be relatively stable - Large values may indicate rendering issues
- Performance/instrumentation__convert_frame
Description: Time spent converting frame format (microseconds).
Interpretation: - Low-level performance metric - Shows image processing overhead - Affected by image resolution and format
What to watch for: - Should be relatively stable - Can be optimized by reducing resolution
- Performance/instrumentation__grab_floats
Description: Time spent grabbing float data from game (microseconds).
Interpretation: - Low-level performance metric - Shows data extraction overhead - Includes speed, position, etc.
What to watch for: - Should be relatively stable - Typically very fast
- Performance/instrumentation__exploration_policy
Description: Time spent in exploration policy computation (microseconds).
Interpretation: - Low-level performance metric - Shows action selection overhead - Includes Q-value computation and exploration
What to watch for: - Should be relatively stable - Affected by network inference speed
- Performance/instrumentation__request_inputs_and_speed
Description: Time spent requesting inputs and speed from game (microseconds).
Interpretation: - Low-level performance metric - Shows game communication overhead - Includes TMInterface API calls
What to watch for: - Should be relatively stable - Large values may indicate communication issues
- Performance/tmi_protection_cutoff
Description: Number of times TMI protection cutoff was triggered.
Interpretation: - Safety mechanism to prevent infinite loops - High values indicate agent is getting stuck frequently - Should be low for well-trained agent
What to watch for: - Should decrease as agent learns - High values indicate learning issues - May need to adjust timeout settings
- Performance/worker_time_in_rollout_percentage
Description: Percentage of rollout time spent in worker processing.
Interpretation: - Shows worker efficiency - Higher values = workers are busy (good) - Lower values = workers are waiting (bad)
What to watch for: - Should be relatively high (>80%) - Low values indicate worker bottlenecks
Buffer Metrics
- Buffer/size
Description: Current number of transitions in replay buffer.
Interpretation: - Grows from 0 to max_size during training - More transitions = more diverse training data - Typical range: 20K to 200K
What to watch for: - Should increase until reaching max_size - Should remain at max_size once full - Sudden decreases may indicate buffer issues
- Buffer/max_size
Description: Maximum capacity of replay buffer.
Interpretation: - Set by memory_size_schedule - Larger buffers = more memory but better diversity - Typical range: 50K to 200K
What to watch for: - Should remain constant unless schedule changes - Changes according to memory_size_schedule
- Buffer/number_times_single_memory_is_used_before_discard
Description: How many times each transition is used before being discarded.
Interpretation: - Controls transition reuse - Higher values = transitions used more times - Balances data efficiency with freshness
What to watch for: - Should remain constant unless explicitly changed - Typical values: 1-4
- Buffer/priorities_min, q1, mean, median, q3, d9, c98, max
Description: Priority statistics for prioritized experience replay.
Interpretation: - Only available if using prioritized replay (prio_alpha > 0) - Higher priorities = more important transitions - Priorities based on TD error - Shows distribution of transition importance
What to watch for: - Large spread indicates some transitions are much more important - Should be relatively stable - Not available if using uniform sampling (prio_alpha = 0)
Network Metrics
- Network/weights/{layer_name}/L2
Description: L2 norm (Euclidean norm) of layer weights.
Interpretation: - Shows magnitude of weights in each layer - Useful for detecting weight growth or decay - Should be relatively stable during training
What to watch for: - Sudden increases may indicate instability - Gradual growth is normal - Very large values may indicate numerical issues
- Network/optimizer/{layer_name}/adaptive_lr_L2
Description: L2 norm of per-parameter adaptive learning rates (Adam/RAdam).
Interpretation: - Shows magnitude of adaptive learning rates - Adam/RAdam adjust learning rate per parameter - Higher values = larger effective learning rates
What to watch for: - Should be relatively stable - Useful for understanding optimizer behavior
- Network/optimizer/{layer_name}/exp_avg_L2
Description: L2 norm of first moment estimate (Adam/RAdam).
Interpretation: - First moment (moving average of gradients) - Used by Adam/RAdam for momentum - Should track gradient magnitudes
What to watch for: - Should be relatively stable - Useful for debugging optimizer state
- Network/optimizer/{layer_name}/exp_avg_sq_L2
Description: L2 norm of second moment estimate (Adam/RAdam).
Interpretation: - Second moment (moving average of squared gradients) - Used by Adam/RAdam for adaptive learning rates - Should track gradient variance
What to watch for: - Should be relatively stable - Useful for debugging optimizer state
IQN Metrics
- IQN/quantile_std_action_{i}
Description: Standard deviation of quantile predictions for action {i}.
Interpretation: - Measures uncertainty in Q-value estimates for each action - Higher values = more uncertainty (wider distribution) - Lower values = more confidence (narrower distribution) - IQN-specific metric (distributional RL)
What to watch for: - Should decrease as agent learns (becomes more confident) - High values indicate high uncertainty - Useful for understanding model confidence - Different actions may have different uncertainty levels
Other Metrics
- alltime_min_ms_{map}
Description: All-time best race time (in milliseconds) for each map.
Interpretation: - Best time ever achieved on each map - Only decreases (new records) - Most important performance metric alongside eval_race_time_robust
What to watch for: - Should decrease over time (new records) - Plateaus indicate agent has reached current limit - Compare with reference times if available
- cumul_number_frames_played
Description: Cumulative number of frames processed during training.
Interpretation: - Total training progress - Used as x-axis in most TensorBoard plots - Typical training: 1M to 50M+ frames
What to watch for: - Should increase steadily - Used to track training progress
- cumul_number_batches_done
Description: Cumulative number of training batches processed.
Interpretation: - Total number of gradient updates - Related to frames_played but depends on buffer fill rate - Higher = more learning steps
What to watch for: - Should increase steadily - Ratio to frames_played shows learning frequency
- cumul_number_single_memories_used
Description: Cumulative number of transitions used for training.
Interpretation: - Total transitions sampled from buffer - May be higher than frames_played due to reuse - Shows total learning experience
What to watch for: - Should increase steadily - Higher than frames_played indicates transition reuse
- cumul_number_memories_generated
Description: Cumulative number of transitions generated from rollouts.
Interpretation: - Total transitions added to buffer - Includes n-step transitions - Shows data collection progress
What to watch for: - Should increase steadily - Should be less than memories_used (due to reuse)
- cumul_training_hours
Description: Cumulative training time in hours.
Interpretation: - Increases only while the learner training loop is running (not calendar time across restarts) - Useful for estimating training duration and aligning analysis with console
Training hours- Includes overhead while the loop runs (not just GPU kernel time)What to watch for: - Should increase steadily - Useful for planning training schedules
- cumul_number_target_network_updates
Description: Cumulative number of target network updates.
Interpretation: - Number of times target network was updated - Target network updated less frequently than online network - Used for stable Q-learning
What to watch for: - Should increase steadily - Frequency depends on update schedule
- times_summary (Text)
Description: Text summary of best times for all maps.
Interpretation: - Human-readable summary of performance - Shows best times with timestamps - Updated every 5 minutes
What to watch for: - Useful for quick overview - Shows new records with ** markers
Merged log folders and time axes (analysis scripts)
Training is often split across several TensorBoard directories (suffix schedule: run, run_2, …).
Analysis tools merge them using a single time origin: the earliest wall_time in any of those
directories. The resulting relative wall minutes are therefore calendar time from that first
event, including nights and idle periods when the learner was not running. That axis is not
the same as “how many hours the network trained.”
For curves and checkpoints that should track actual training progress, use either:
The scalar ``cumul_training_hours`` (also stored in
accumulated_stats.joblibunder the same key), e.g.python scripts/analyze_experiment_by_relative_time.py ... --time-axis cumul_training_hours; orBY STEP tables (checkpoints in environment steps), which are unaffected by wall-clock gaps.
Final bests and step-based summaries taken from the end of a run remain meaningful; what goes wrong is only the interpretation of “at X wall minutes” when the timeline spans long pauses or merged chunks.
To see whether wall span and cumul_training_hours disagree for a run, use
python scripts/audit_tensorboard_training_timeline.py [--runs RUN …].
Tips for Using TensorBoard
Filtering: Use the search box in TensorBoard to filter metrics by prefix (e.g., type “Gradients/” to see all gradient metrics)
Custom Scalars: The “Custom Scalars” tab has pre-configured layouts for key metrics grouped together
Smoothing: Use the smoothing slider to reduce noise in plots (helpful for noisy metrics)
Comparison: Load multiple runs to compare different training configurations
Key Metrics to Monitor: -
Race/eval_race_time_robust- Primary performance metric -RL/avg_Q- Learning progress indicator -Gradients/norm_before_clip_max- Training stability -Training/loss- Learning quality -Performance/transitions_learned_per_second- Training efficiencyEarly Training (0-3M frames): - Watch
RL/single_zone_reached- should increase to 1.0 - WatchRL/avg_Q- may decrease then increase - WatchTraining/loss- may increase (normal!)Mid Training (3-10M frames): - Watch
Race/eval_race_time_robust- should decrease - WatchRL/avg_Q- should increase - WatchTraining/loss- should stabilizeLate Training (10M+ frames): - Watch
Race/eval_race_time_robust- slow improvements - Watch for plateaus - may need longer training or hyperparameter changes