Learning#

This package contains the implementations of the learning algorithms.

Algorithms Available#

BaseAlgorithm

EvolutionStrategy

Q-Learning

Deep Q-Learning

Deep Deterministic Policy Gradient

Proximal Policy Optimization

Usage#

All the algorithms are implemented as classes, which can be used as follows:

from learning import Algorithm
from agents import Agent

env_kwargs = {'id': 'CartPole-v1'}
agent_kwargs = {'hidden_sizes': [64, 64]}
algorithm = Algorithm(env_kwargs, agent_kwargs)

algorithm.train()
mean, std = algorithm.test()
algorithm.save_plots()
algorithm.save_videos()

BaseAlgorithm#

The learning.base_algorithm.BaseAlgorithm class gives the baselines for an algorithm to be useable by our implementations.

For an algorithm to be useable, it must implement the following methods:

learning.base_algorithm.BaseAlgorithm.train_()

learning.base_algorithm.BaseAlgorithm.save()

learning.base_algorithm.BaseAlgorithm.load()

class learning.base_algorithm.BaseAlgorithm(env_kwargs, num_envs, max_episode_length=-1, max_total_reward=-1, save_folder='results', normalize_observation=False, seed=42, envs_wrappers=None)#

This class is the base class for all algorithms.

Once implemented, an algorithm can be used as follow:

import gymnasium
from rlib.learning import DeepQLearning

env_kwargs = {"env_name": "CartPole-v1"}
agent_kwargs = {"hidden_sizes": "[200, 200]"}

model = DeepQLearning(env_kwargs, agent_kwargs)

model.train()
model.test()

Variables:

env_kwargs (dict) – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
num_envs (int) – The number of environments to use for training.
max_episode_length (int) – Maximum number of steps taken to complete an episode.
max_total_reward (float) – Maximum reward achievable in one episode
save_folder (str) – The path of the folder where to save the results
videos_folder (str) – The path of the folder where to save the videos
models_folder (str) – The path of the folder where to save the models
plots_folder (str) – The path of the folder where to save the plots
current_agent – The current agent used by the algorithm
env_kwargs – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
normalize_observation (bool) – Whether to normalize the observation in [-1, 1]
envs_wrappers (list, optional) – The wrappers to use for the environment, by default None.

__init__(env_kwargs, num_envs, max_episode_length=-1, max_total_reward=-1, save_folder='results', normalize_observation=False, seed=42, envs_wrappers=None)#

Base class for all the algorithms.

Parameters:

env_kwargs (dict) – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
num_envs (int) – The number of environments to use for training.
max_episode_length (int, optional) – Maximum number of steps taken to complete an episode. Default is -1 (no limit)
max_total_reward (float, optional) – Maximum reward achievable in one episode. Default is -1 (no limit)
save_folder (str, optional) – The path of the folder where to save the results. Default is “results”
normalize_observation (bool, optional) – Whether to normalize the observation in [-1, 1]. Default is False
seed (int) – The seed to use for the environment.
envs_wrappers (list, optional) – The wrappers to use for the environment, by default None.

train()#

Default training method.

Along with the training, it creates the folders for saving the results, saves the hyperparameters and the git info.

abstract classmethod train_() → None#

Train the agent on the environment.

This method should be implemented in the child class.

test(num_episodes=1, display=False, save_video=False, video_path=None, seed=None)#

Test the current agent on the environment.

Parameters:

num_episodes (int, optional) – The number of episodes to test the agent on, by default 1.
display (bool, optional) – Whether to display the game, by default False.
save_video (bool, optional) – Whether to save a video of the game, by default False.
video_path (str, optional) – The path to save the video to, by default None.

Returns:

The mean reward obtained over the episodes, and the standard deviation of the reward obtained over the episodes.

Return type:

float, float

Raises:

ValueError – If num_episodes is not strictly positive.

abstract classmethod save(path)#

Save the current agent to the given path.

Parameters:: path (str) – The path to save the agent to.

abstract classmethod load(path)#

Load the agent from the given path.

Parameters:: path (str) – The path to load the agent from.

save_plots()#

Save the plots of the training.

The plots are saved in the plots folder plots_folder.

save_videos()#

Saves videos of the models saved at testing iterations.

The videos are saved in the saving folder save_folder.

EvolutionStrategy#

class learning.evolution_strategy.EvolutionStrategy(env_kwargs, agent_kwargs, num_agents=30, num_iterations=300, lr=0.03, sigma=0.1, test_every=50, num_test_episodes=5, max_episode_length=-1, max_total_reward=-1, save_folder='evolution_strategy', stop_max_score=False, verbose=True, normalize_observation=False, seed=42)#

Implementation of the Evolution Strategy algorithm.

This algorithm does not need gradient computation, it is therefore compatible with any agent, however, for simplicity PyTorch network are used here.

The update rule of the weights is given by:

\[\theta_{t+1} = \theta_t + \frac{1}{N \sigma} \sum_{i=1}^N r_i \epsilon_i\]

where \(w_t\) are the weights at iteration \(t\), \(N\) is the number of agents, \(\sigma\) is the noise standard deviation, \(r_i\) is the reward obtained by the agent with weights \(w_t\) + \(\sigma\epsilon_i\).

Each \(\epsilon_i\) is sampled from a normal distribution with mean 0 and standard deviation 1.

Examples

from rlib.learning import EvolutionStrategy

env_kwargs = {'id': 'CartPole-v0'}
agent_kwargs = {'hidden_sizes': [32, 32]}

model = EvolutionStrategy(env_kwargs, agent_kwargs, num_agents=30, num_iterations=300)
model.train()
model.save_plots()
model.save_videos()

__init__(env_kwargs, agent_kwargs, num_agents=30, num_iterations=300, lr=0.03, sigma=0.1, test_every=50, num_test_episodes=5, max_episode_length=-1, max_total_reward=-1, save_folder='evolution_strategy', stop_max_score=False, verbose=True, normalize_observation=False, seed=42)#

Initialize the Evolution Strategy algorithm.

Parameters:

env_kwargs (dict) – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
agent_kwargs (dict) – Kwargs used to call get_agent(kwargs=agent_kwargs), some parameters are automatically infered (inputs sizes, MLP or CNN, …).
num_agents (int, optional) – The number of agents to use to compute the gradient, by default 30
num_iterations (int, optional) – The number of iterations to run the algorithm, by default 300
lr (float, optional) – The learning rate, by default 0.03
sigma (float, optional) – The noise standard deviation, by default 0.1
test_every (int, optional) – The number of iterations between each test, by default 50
num_test_episodes (int, optional) – The number of episodes to play during each test, by default 5
max_episode_length (int, optional) – The maximum number of steps to play in an episode, by default -1 (no limit)
max_total_reward (int, optional) – The maximum total reward to get in an episode, by default -1 (no limit)
save_folder (str, optional) – The folder where to save the models at each test step, by default “evolution_strategy”
stop_max_score – Whether to stop the training when the maximum score is reached on a test run, by default False
verbose (bool, optional) – Whether to display a progression bar during training, by default True
normalize_observation (bool, optional) – Whether to normalize the observation, by default False
seed (int, optional) – The seed to use for the environment, by default 42

_get_random_parameters()#

Returns some randomly generated parameters sampled from normal distribution

Returns:: The randomly generated parameters.
Return type:: dict[str, torch.Tensor]

_parameters_update(params, test_rewards, test_noise, lr, sigma)#

Computes the new parameters of the agent.

The new parameters are given by the formula in EvolutionStrategy.

Parameters:

params (dict[str, torch.Tensor]) – The current parameters of the agent.
test_rewards (list[float]) – The rewards obtained by the agents with the current parameters and the noise.
test_noise (dict[str, torch.Tensor]) – The noise used to compute the gradient.
lr (float) – The learning rate.
sigma (float) – The noise standard deviation.

Returns:

The new parameters of the agent.

Return type:

dict[str, torch.Tensor]

_get_test_parameters(params, sigma, noise)#

Given the current parameters, the noise standard deviation and the noise, returns the parameters to test.

Parameters:

params (dict[str, torch.Tensor]) – The current parameters of the agent.
sigma (float) – The noise standard deviation.
noise (dict[str, torch.Tensor]) – The noise to add to the parameters.

Returns:

The parameters to test.

Return type:

dict[str, torch.Tensor]

QLearning#

class learning.q_learning.QLearning(env_kwargs, agent_kwargs, max_episode_length=-1, max_total_reward=-1, save_folder='qlearning', num_iterations=1000, lr=0.03, discount=0.99, epsilon_greedy=0.9, epsilon_decay=0.9999, epsilon_min=0.01, test_every=10, num_test_episodes=5, verbose=True, seed=42)#

Applies the QLearning algorithm to the environment.

The Q-Table is updated using the following formula:

\[Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left(r_{t+1} + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) \right)\]

where \(\alpha\) is the learning rate and \(\gamma\) the discount factor.

An epsilon greedy policy is used to select the actions.

Example:

import gymnasium
from rlib.learning import QLearning

env_kwargs = {'id': 'MountainCar-v0'}
agent_kwargs = {'grid_size': 20}

model = QLearning(env_kwargs, agent_kwargs, lr=0.03, discount=0.99, epsilon_greedy=0.1, epsilon_decay=0.9999, epsilon_min=0.01))

model.train()
model.save_plots()
model.save_videos()

__init__(env_kwargs, agent_kwargs, max_episode_length=-1, max_total_reward=-1, save_folder='qlearning', num_iterations=1000, lr=0.03, discount=0.99, epsilon_greedy=0.9, epsilon_decay=0.9999, epsilon_min=0.01, test_every=10, num_test_episodes=5, verbose=True, seed=42)#

Initialize the QLearning algorithm.

Parameters:

env_kwargs (dict) – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
agent_kwargs (dict) – Kwargs to call get_agent(kwargs=agent_kwargs, q_table=True), the env_kwargs parameter can be ommited.
max_episode_length (int, optional) – The maximum length of an episode, by default -1 (no limit).
max_total_reward (float, optional) – The maximum total reward to get in the episode, by default -1 (no limit).
save_folder (str, optional) – The path of the folder where to save the results. Default is “results”
num_iterations (int, optional) – The number of episodes to train. Default is 1000.
lr (float, optional) – The learning rate. Default is 0.03.
discount (float, optional) – The discount factor. Default is 0.99.
epsilon_greedy (float, optional) – The epsilon greedy parameter. Default is 0.9.
epsilon_decay (float, optional) – The epsilon decay parameter. Default is 0.9999.
epsilon_min (float, optional) – The minimum epsilon value. Default is 0.01.
test_every (int, optional) – The number of episodes between each save. Default is 10.
num_test_episodes (int, optional) – The number of episodes to test. Default is 5.
verbose (bool, optional) – Whether to print the results of each episode. Default is True.
seed (int, optional) – The seed for the environment. Default is 42.

Deep Q-Learning#

class learning.deep_q_learning.DeepQLearning(env_kwargs, agent_kwargs, max_episode_length=-1, max_total_reward=-1, save_folder='deep_qlearning', lr=0.0003, discount=0.99, epsilon_start=0.1, epsilon_min=0.01, exploration_fraction=0.1, num_time_steps=100000, learning_starts=50000, update_every=4, number_updates=None, main_target_update=10, verbose=True, test_every=50000, num_test_episodes=10, batch_size=64, size_replay_buffer=100000, max_grad_norm=10, normalize_observation=False, stop_max_score=False, seed=42)#

Deep Q-Learning algorithm.

The Q-Table is replaced by a neural network that approximates the Q-Table.

The neural network is updated using the following formula:

\[Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left(r_{t+1} + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) \right)\]

where \(\alpha\) is the learning rate and \(\gamma\) the discount factor.

Hence, the method is only suitable for discrete action spaces.

An epsilon greedy policy is used to select the actions, and the actions are stored in a ReplayBuffer before being used to update the neural network.

Example:

from rlib.learning import DeepQLearning

env_kwargs = {'id': 'CartPole-v1'}
agent_kwargs = {'hidden_sizes': [10]}

model = DeepQLearning(
    env_kwargs, agent_kwargs,
    lr=0.03, discount=0.99,
    epsilon_start=0.1, epsilon_min=0.01
    )

model.train()
model.test()
model.save_plots()

__init__(env_kwargs, agent_kwargs, max_episode_length=-1, max_total_reward=-1, save_folder='deep_qlearning', lr=0.0003, discount=0.99, epsilon_start=0.1, epsilon_min=0.01, exploration_fraction=0.1, num_time_steps=100000, learning_starts=50000, update_every=4, number_updates=None, main_target_update=10, verbose=True, test_every=50000, num_test_episodes=10, batch_size=64, size_replay_buffer=100000, max_grad_norm=10, normalize_observation=False, stop_max_score=False, seed=42)#

Initializes the DeepQLearning algorithm.

Parameters:

env_kwargs (dict) – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
agent_kwargs (dict) – The kwargs for calling get_agent(obs_space, action_space, **agent_kwargs).
max_episode_length (int, optional) – The maximum length of an episode, by default -1 (no limit).
max_total_reward (float, optional) – The maximum total reward to get in the episode, by default -1 (no limit).
save_folder (str, optional) – The folder where to save the model, by default “deep_qlearning”.
lr (float, optional) – The learning rate, by default 3e-4.
discount (float, optional) – The discount factor, by default 0.99.
epsilon_start (float, optional) – The probability to take a random action, by default 0.1.
epsilon_min (float, optional) – The minimum value of epsilon greedy, by default 0.01.
exploration_fraction (float, optional) – The fraction of the training time during which the epsilon is decreased, if 0.1 epsilon_min will be reached after 10% of training time, by default 0.1.
num_time_steps (int, optional) – The number of time steps to train the agent, by default 100_000.
learning_starts (int, optional) – The number of time steps before starting to train the agent, by default 50_000.
update_every (int, optional) – The number of time steps between each update of the neural network, by default 4.
number_updates (int, optional) – The number of updates to perform at each time step, by default set to update_every.
main_target_update (int, optional) – The number of time steps between each update of the target network, by default 100.
verbose (bool, optional) – Whether to print the results, by default True.
test_every (int, optional) – The number of time steps between each test, by default 50_000.
num_test_episodes (int, optional) – The number of episodes to test the agent, by default 10.
batch_size (int, optional) – The batch size, by default 64.
size_replay_buffer (int, optional) – The size of the replay buffer, by default 100_000.
max_grad_norm (int, optional) – The maximum norm of the gradients, by default 10.
normalize_observation (bool, optional) – Whether to normalize the observation in [-1, 1], by default False.
stop_max_score (bool, optional) – Whether to stop the training when the maximum score is reached, by default False.
seed (int, optional) – The seed for the environment, by default 42.

_populate_replay_buffer(env)#

Populate the replay buffer with random samples from the environment.

This is done until the replay buffer is filled with learning_starts samples. Furthermore, the actions are sampled randomly with probability epsilon.

Parameters:: env (gymnasium.ENV) – The environment to sample from.

update_weights()#

Update the weights of the neural network.

From the :replay_buffer, a batch of size batch_size is used to update the weights of the neural network using the following loss:

Deep Deterministic Policy Gradient#

class learning.ddpg.DDPG(env_kwargs, mu_kwargs, q_kwargs, max_episode_length=-1, max_total_reward=-1, save_folder='ddpg', q_lr=0.0003, mu_lr=0.0003, lr_annealing=True, action_noise=0.1, target_noise=0.2, num_updates_per_iter=10, delay_policy_update=2, twin_q=True, discount=0.99, num_episodes=1000, learning_starts=50000, target_update_tau=0.01, verbose=True, test_every=10, num_test_episodes=10, batch_size=64, size_replay_buffer=100000, max_grad_norm=10, normalize_observation=False, use_norm_wrappers=True, seed=42)#

Implementation of the Deep Deterministic Policy Gradient algorithm with options to use the improvements from TD3.

Here, a Policy agent \(\mu(s)\) and a Q-function agent \(Q(s, a)\) are used. \(\mu(s)\) is trained to maximize \(Q(s, \mu(s))\) and \(Q(s, a)\) is trained to minimize \((Q(s, a) - (r + \gamma Q(s', \mu(s'))))^2\).

Because of the nature of the problem, including the fact that \(\mu(s)\) should be differentiable, only environments with continuous action spaces are supported.

For the TD3 improvements, two Q-functions are used, and the policy is updated less frequently.

Example:

from rlib.learning import DDPG
import gymnasium as gym

env_kwargs = {'id': 'BipedalWalker-v3', 'hardcore': False}
mu_kwargs = {'hidden_sizes': [256, 256]}
q_kwargs = {'hidden_sizes': [256, 256]}

model = DDPG(
    env_kwargs, mu_kwargs, q_kwargs,
    max_episode_length=1600, max_total_reward=-1,
    save_folder="ddpg", q_lr=3e-4, mu_lr=3e-4,
    action_noise=0.1, target_noise=0.2, delay_policy_update=2,
    twin_q=True, discount=0.99, num_episodes=2_000,
    learning_starts=0, target_update_tau=0.005,
)

model.train()
model.test()

__init__(env_kwargs, mu_kwargs, q_kwargs, max_episode_length=-1, max_total_reward=-1, save_folder='ddpg', q_lr=0.0003, mu_lr=0.0003, lr_annealing=True, action_noise=0.1, target_noise=0.2, num_updates_per_iter=10, delay_policy_update=2, twin_q=True, discount=0.99, num_episodes=1000, learning_starts=50000, target_update_tau=0.01, verbose=True, test_every=10, num_test_episodes=10, batch_size=64, size_replay_buffer=100000, max_grad_norm=10, normalize_observation=False, use_norm_wrappers=True, seed=42)#

Initializes the DDPG algorithm.

Parameters:

env_kwargs (dict) – The kwargs for calling gym.make(**env_kwargs, render_mode=render_mode).
mu_kwargs (dict) – The kwargs for the policy agent.
q_kwargs (dict) – The kwargs for the Q-function agent.
max_episode_length (int, optional) – The maximum length of an episode, by default -1 (no limit).
max_total_reward (float, optional) – The maximum total reward to get in the episode, by default -1 (no limit).
save_folder (str, optional) – The folder where to save the models, plots and videos, by default “ddpg”.
q_lr (float, optional) – The learning rate for the Q-function agent, by default 3e-4.
mu_lr (float, optional) – The learning rate for the policy agent, by default 3e-4.
action_noise (float, optional) – The noise added during population of the replay buffer, by default 0.1.
target_noise (float, optional) – The noise added to target actions, by default 0.2.
num_updates_per_iter (int, optional) – The number of updates per iteration, by default 10.
delay_policy_update (int, optional) – The number of Q-function updates before updating the policy, by default 2.
twin_q (bool, optional) – Whether to use two Q-functions, by default True.
discount (float, optional) – The discount factor, by default 0.99.
num_episodes (int, optional) – The number of episodes to train, by default 1000.
learning_starts (int, optional) – The number of random samples in the replay buffer before training, by default 50_000.
target_update_tau (float, optional) – The percentage of weights to copy from the main model to the target model, by default 0.01.
verbose (bool, optional) – Whether to print the progress, by default True.
test_every (int, optional) – The number of episodes between each test, by default 50_000.
num_test_episodes (int, optional) – The number of episodes to test, by default 10.
batch_size (int, optional) – The batch size for training, by default 64.
size_replay_buffer (int, optional) – The size of the replay buffer, by default 100_000.
max_grad_norm (int, optional) – The maximum norm of the gradients, by default 10.
normalize_observation (bool, optional) – Whether to normalize the observations, by default False.
use_norm_wrappers (bool, optional) – Whether to use the ClipAction, NormalizeReward and clip the observations and rewards to [-10, 10]. This is useful for the MuJoCo environments, by default True.
seed (int, optional) – The seed for the random number generator, by default 42.

Raises:

ValueError – If the action space is not continuous.
NotImplementedError – If the observation space is not 1D, 2D or 3D.

_update_target_weights(tau=0.01)#

Updates the target weights using the current weights.

It uses the formula:

\[\theta_{target} = \tau \theta_{current} + (1 - \tau) \theta_{target}\]

_populate_replay_buffer(env)#

Plays random actions in the environment to populate the replay buffer, until the number of samples is equal to learning_starts.

Parameters:: env (gymnasium.ENV) – The environment to use.

update_weights()#: Updates the neural networks using the replay buffer.

Proximal Policy Optimization

class learning.ppo.PPO(env_kwargs, actor_kwargs={}, critic_kwargs={}, num_envs=10, save_folder='ppo', normalize_observation=False, seed=42, num_steps_per_iter=2048, num_updates_per_iter=10, total_timesteps=1000000, max_episode_length=-1, max_total_reward=-1, test_every=10000, num_test_agents=10, batch_size=64, discount=0.99, use_gae=True, lambda_gae=0.95, policy_loss_clip=0.2, clip_value_loss=True, value_loss_clip=0.2, value_loss_coef=0.5, entropy_loss_coef=0.01, max_grad_norm=0.5, learning_rate=0.0003, lr_annealing=True, norm_advantages=True)#

Implementation of the Proximal Policy Optimization algorithm (Paper).

For this algorithm, the policy network is optimized using the clipped surrogate objective function:

\[L^{CLIP} = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]\]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio between the new and old policies, and \(\hat{A}_t\) is the advantage function \(A(s, a) = Q(s, a) - V(s)\) estimated using Generalized Advantage Estimation (GAE) or not.

Note that \(V(s)\) is estimated using a critic network, and that \(\epsilon\) allows to control the magnitude of the policy update.

Example:

from rlib.learning import PPO

env_kwargs = {'id': 'CartPole-v1'}
actor_kwargs = {'hidden_sizes': [64, 64]}
critic_kwargs = {'hidden_sizes': [64, 64]}
ppo = PPO(env_kwargs, actor_kwargs, critic_kwargs, batch_size=10,
          num_steps_per_iter=2_000, total_timesteps=200_000,   # 10 iterations, 2_000 * num_envs steps per iteration
          num_envs=10, seed=42)

ppo.train()

Variables:

buffer (rlib.learning.rollout_buffer.RolloutBuffer) – Buffer used to store the transitions.
current_agent (rlib.learning.ppo.PPOAgent) – Current agent used to interact with the environment.

__init__(env_kwargs, actor_kwargs={}, critic_kwargs={}, num_envs=10, save_folder='ppo', normalize_observation=False, seed=42, num_steps_per_iter=2048, num_updates_per_iter=10, total_timesteps=1000000, max_episode_length=-1, max_total_reward=-1, test_every=10000, num_test_agents=10, batch_size=64, discount=0.99, use_gae=True, lambda_gae=0.95, policy_loss_clip=0.2, clip_value_loss=True, value_loss_clip=0.2, value_loss_coef=0.5, entropy_loss_coef=0.01, max_grad_norm=0.5, learning_rate=0.0003, lr_annealing=True, norm_advantages=True)#

Initialize the PPO algorithm.

Parameters:

env_kwargs (dict) – Keyword arguments to pass to the gym environment.
actor_kwargs (dict) – Keyword arguments to pass to the actor network, as get_agent(kwargs=actor_kwargs).
critic_kwargs (dict) – Keyword arguments to pass to the critic network, as get_agent(kwargs=critic_kwargs, ppo_critic=True).
num_envs (int) – Number of parallel environments. Default is 10.
save_folder (str) – Folder where to save the model. Default is ‘ppo’.
normalize_observation (bool) – Whether to normalize the observations or not. Default is False.
seed (int) – Seed for the random number generator. Default is 42.
num_steps_per_iter (int) – Number of steps per iteration per environment. Default is 2048.
num_updates_per_iter (int) – Number of network updates per iteration. Default is 10.
total_timesteps (int) – Total number of steps to train the agent, should be divisible by num_steps_per_iter x num_envs. Default is 1_000_000.
max_episode_length (int) – Maximum length of an episode. If -1, there is no maximum length. Default is -1.
max_total_reward (int) – Maximum total reward of an episode. If -1, there is no maximum total reward. Default is -1.
test_every (int) – Number of steps between each test. Default is 10_000.
num_test_agents (int) – Number of agents to test. Default is 10.
batch_size (int) – Batch size. Default is 64.
discount (float) – Discount factor. Default is 0.99.
use_gae (bool) – Whether to use Generalized Advantage Estimation (GAE), if not the advantage estimate is \(A(s, a) = \hat{Q}(s, a) - V(s)\). Default is True.
lambda_gae (float) – Lambda parameter for GAE. Default is 0.95.
policy_loss_clip (float) – Epsilon parameter for the clipped surrogate objective function. Default is 0.2.
clip_value_loss (bool) – Whether to clip the value loss or not. Default is True.
value_loss_clip (float) – Epsilon parameter for the clipped value loss. Default is 0.2.
value_loss_coef (float) – Coefficient for the value loss. Default is 0.5.
entropy_loss_coef (float) – Coefficient for the entropy loss. Default is 0.01.
max_grad_norm (float) – Maximum norm of the gradients. Default is 0.5.
learning_rate (float) – Learning rate. Default is 3e-4.
lr_annealing (bool) – Whether to linearly anneal the learning rate or not. Default is True.
norm_advantages – Whether to normalize the advantages or not. Default is True.

rollout(writer)#

Performs one rollout of the agent in the environment.

The number of stored transitions is equal to num_steps_per_iter x num_envs.

Parameters:: writer (torch.utils.tensorboard.SummaryWriter) – Tensorboard writer.

update_agent(writer)#

Updates the agent using data stored in the buffer.

Parameters:: writer (torch.utils.tensorboard.SummaryWriter) – Tensorboard writer.