TD3 (Twin Delayed DDPG)

Agent Class

class modularl.agents.TD3(actor: Module, qf1: Module, qf2: Module, actor_optimizer: Optimizer, qf_optimizer: Optimizer, replay_buffer: TensorDictReplayBuffer, gamma: float = 0.99, batch_size: int = 32, learning_starts: int = 0, tau: float = 0.005, exploration_noise: float = 0.1, policy_noise: float = 0.2, noise_clip: float = 0.5, policy_frequency: int = 2, device: str = 'cpu', burning_action_func: Callable | None = None, writer: SummaryWriter | None = None, **kwargs: Any)[source]

Bases: AbstractAgent

Twin Delayed Deep Deterministic Policy Gradient (TD3) Agent

Parameters:

actor (torch.nn.Module) – The actor network.
qf1 (torch.nn.Module) – The first Q-function network.
qf2 (torch.nn.Module) – The second Q-function network.
actor_optimizer (torch.optim.Optimizer) – Optimizer for the actor network.
qf_optimizer (torch.optim.Optimizer) – Optimizer for the Q-function networks.
replay_buffer (TensorDictReplayBuffer) – Replay buffer for storing experiences.
gamma (float, optional) – Discount factor for future rewards. Defaults to 0.99.
batch_size (int, optional) – Number of samples per batch for training. Defaults to 32.
learning_starts (int, optional) – Number of steps before learning starts. Defaults to 0.
tau (float, optional) – Soft update coefficient for target networks. Defaults to 0.005.
exploration_noise (float, optional) – Noise added to the actor policy during training. Defaults to 0.1.
policy_noise (float, optional) – Noise added to the target policy during critic updates. Defaults to 0.2.
noise_clip (float, optional) – Range to clip the target policy noise. Defaults to 0.5.
policy_frequency (int, optional) – Frequency of delayed policy updates. Defaults to 2.
device (str, optional) – Device to run the agent on (e.g., “cpu” or “cuda”). Defaults to “cpu”.
burning_action_func (Callable, optional) – Function for generating initial exploratory actions. Defaults to None.
writer (SummaryWriter, optional) – Tensorboard writer for logging. Defaults to None.

init() → None[source]: Initialize the agent.

observe(batch_obs: Tensor, batch_actions: Tensor, batch_rewards: Tensor, batch_next_obs: Tensor, batch_dones: Tensor) → None[source]

Observe the environment and store the transition in the replay buffer.

Parameters:

batch_obs – (torch.Tensor) Tensor containing the observations. Shape: (batch_size, *)
batch_actions – (torch.Tensor) Tensor containing the actions. Shape: (batch_size, action_dim)
batch_rewards – (torch.Tensor) Tensor containing the rewards. Shape: (batch_size,)
batch_next_obs – (torch.Tensor) Tensor containing the next observations. Shape: (batch_size, *)
batch_dones – (torch.Tensor) Tensor containing the dones. Shape: (batch_size,)

where:

batch_size is the number of transitions in the batch
action_dim is the dimension of the action space
* in the observation shape represents the dimension of the observation space, which depends on the environment and the networks (e.g., (3, 64, 64) for images or (512,) for vectors)

Note

The shapes assume a vectorized environment. For a single environment, batch_size would typically be 1.

act_train(batch_obs: Tensor) → Tensor[source]

Select an action for training.

Parameters:: batch_obs – (torch.Tensor) Tensor containing the observations.
Returns:: (torch.Tensor) Selected actions for training.

act_eval(batch_obs: Tensor) → Tensor[source]

Select an action for evaluation.

Parameters:: batch_obs – (torch.Tensor) Tensor containing the observation.
Returns:: (torch.Tensor) Selected action for evaluation.

update() → None[source]: Perform a training update.

Example Usage

Here’s an example of how to use the TD3 agent:

import torch
import gym
from modularl.agents import TD3
from modularl.policies import DeterministicPolicy
from modularl.q_functions import SAQNetwork
from modularl.replay_buffers import ReplayBuffer
import copy

# Create an environment
env = gym.make('Pendulum-v1')

# Get observation and action space dimensions
observation_shape = env.observation_space.shape[0]  # 3 for Pendulum-v1
action_shape = env.action_space.shape[0]  # 1 for Pendulum-v1

# Get action bounds
high_action = env.action_space.high[0]  # 2.0 for Pendulum-v1
low_action = env.action_space.low[0]  # -2.0 for Pendulum-v1

# Optional: Define a custom network
class CustomNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return x

custom_network_actor = CustomNetwork(observation_shape, 64)
custom_network_critic = copy.deepcopy(custom_network_actor)

# Initialize the actor and critic networks
actor = DeterministicPolicy(observation_shape, action_shape, high_action, low_action, custom_network_actor)
critic1 = SAQNetwork(observation_shape, action_shape, custom_network_critic)
critic2 = SAQNetwork(observation_shape, action_shape, custom_network_critic)

# Initialize the actor and critics optimizers
actor_optimizer = torch.optim.Adam(actor.parameters(), lr=3e-4)
critic_optimizer = torch.optim.Adam(list(critic1.parameters()) + list(critic2.parameters()), lr=3e-4)

# Initialize the replay buffer
buffer_size = 100000
replay_buffer = ReplayBuffer(buffer_size)

# Initialize the TD3 agent
agent = TD3(
    actor=actor,
    qf1=critic1,
    qf2=critic2,
    actor_optimizer=actor_optimizer,
    q_optimizer=critic_optimizer,
    replay_buffer=replay_buffer,
    batch_size=64,
    gamma=0.99,
    tau=0.005,
    exploration_noise=0.1,
    policy_noise=0.2,
    noise_clip=0.5,
    policy_frequency=2,
    device="cuda:0",
)
agent.init()

# Training loop
num_episodes = 1000
for episode in range(num_episodes):
    state = env.reset()
    episode_reward = 0
    done = False
    state = torch.tensor(state, dtype=torch.float32)
    while not done:
        action = agent.act_train(state)
        next_state, reward, done, _ = env.step(action)
        # Convert to torch tensors
        next_state = torch.tensor(next_state, dtype=torch.float32)
        reward = torch.tensor(reward, dtype=torch.float32)
        done = torch.tensor(done, dtype=torch.float32)

        # Observe and update the agent
        agent.observe(state, action, reward, next_state, done)
        agent.update()

        state = next_state
        episode_reward += reward
    print(f"Episode {episode+1}, Reward: {episode_reward}")

# Save the trained agent
agent.save('td3_agent.pth') #Not implemented yet (consider saving the actor and critic models manually)

This example demonstrates how to:

Create a Gym environment (Pendulum-v1 in this case)
Extract necessary information from the environment (observation shape, action shape, and action bounds)
Define custom networks for the actor and critic (optional)
Initialize the DeterministicPolicy (actor) and SAQNetwork (critic) with appropriate parameters
Set up optimizers for the actor and critic
Create a replay buffer
Initialize the TD3 agent with all components and hyperparameters
Run a training loop to interact with the environment and update the agent
Log rewards using print statements (TensorBoard logging can be added similarly)