============ Main Objects ============ In this page, we list the main objects used throughout the code. rollout_results --------------- ``rollout_results`` is a dictionary designed to hold all information collected during a rollout (synonym for "a race"). This dictionary is created within the ``GameInstanceManager.rollout()`` function, then passed by ``trackmania_rl.multiprocess.collector_process.collector_process_fn()`` in a ``multiprocessing.Queue`` so that it can be read by ``trackmania_rl.multiprocess.collector_process.learner_process_fn()``. Within the learner process, ``rollout_results`` is passed to ``buffer_management.fill_buffer_from_rollout_with_n_steps_rule()`` to fill a ``ReplayBuffer``. After this, ``rollout_results`` can be discarded. .. code-block:: python rollout_results = { "current_zone_idx": [], "frames": [], "input_w": [], "actions": [], "action_was_greedy": [], "car_gear_and_wheels": [], "q_values": [], "meters_advanced_along_centerline": [], "state_float": [], "furthest_zone_idx": 0, } buffer and buffer_test ---------------------- ``buffer`` and ``buffer_test`` are created in ``trackmania_rl/buffer_utitilies/make_buffers()`` and used exclusively within the learner process. They are basic ``ReplayBuffer`` objects from the torchrl library, designed to hold transitions used to train the agent. The buffer's behavior is customized with ``buffer_utilities.buffer_collate_function()`` to implement "mini-races" during sampling: a way to re-interpret states as being part of a "mini-race" instead of the full trajectory along the racetrack. This trick masks consequences of actions further than a given horizon, allows us to optimise with ``gamma = 1`` and generally simplifies the learning process for the agent. By default, ``buffer`` contains 95% of transitions and is used to train the agent. ``buffer_test`` contains the remaining 5% of transitions and is used as a hidden test set to monitor the agent's tendency to overfit its memory. Experience ---------- The class ``Experience`` defined in ``trackmania_rl/experience_replay/`` defines the way a transition is stored in memory. .. code-block:: python """ (state_img, state_float): represent "state", ubiquitous in reinforcement learning state_img is a np.array of shape (1, H, W) and dtype np.uint8 state_float is a np.array of shape (config.float_input_dim, ) and dtype np.float32 (next_state_img, next_state_float): represent "next_state" next_state_img is a np.array of shape (1, H, W) and dtype np.uint8 next_state_float is a np.array of shape (config.float_input_dim, ) and dtype np.float32 (state_potential and next_state_potential) are floats, used for reward shaping as per Andrew Ng's paper: https://people.eecs.berkeley.edu/~russell/papers/icml99-shaping.pdf action is an integer representing the action taken for this transition, mapped to config_files/inputs_list.py terminal_actions is an integer representing the number of steps between "state" and race finish in the rollout from which this transition was extracted. If the rollout did not finish (ie: early cutoff), then contains math.inf n_steps How many steps were taken between "state" and "next state". Not all transitions contain the same value, as this may depend on exploration policy. Note that in buffer_collate_function, a transition may be reinterpreted as terminal with a lower n_steps, depending on the random horizon that was sampled. gammas a numpy array of shape (config.n_steps, ) containing the gamma value if steps = 0, 1, 2, etc... rewards a numpy array of shape (config.n_steps, ) containing the reward value if steps = 0, 1, 2, etc... The structure of these transitions is unusual. It comes from our "mini-race" logic which will be explained somewhere else. I don't know where yet. This is how we are able to define Q-values as "the sum of expected rewards obtained during the next 7 seconds", and how we can optimise with gamma = 1. """ IQN_Network ----------- Implemented in ``trackmania_rl.agents.iqn`` the IQN_Network class inherits from ``torch.nn.Module``. It holds the weights that parameterize the IQN agent's policy, and defines the neural network's structure. Multiple instances of the IQN_Network class coexist within the code: - Each collector process possesses an ``inference_network``, with JIT compilation enabled by default. - The learner process passes an ``online_network`` and a ``target_network``, with JIT compilation enabled by default. These instances **do not share weights**, they are independent instances. The learner process and collector processes have access to a common uncompiled ``uncompiled_shared_network`` created in ``scripts/train.py``. The learner will regularly copy weights from the ``online_network`` to the ``uncompiled_shared_network``. Collector processes will regularly copy weights from the ``uncompiled_shared_network`` to their own ``inference_network``. Locks are used to avoid simultaneous writing and reading from the ``uncompiled_shared_network``. The network's structure is further defined in the class' ``forward()`` method. A detailed architecture description with block diagrams is in :doc:`experiments/models/iqn_architecture`.