DDPG (Deep Deterministic Policy Gradient)
Agent Class
- class modularl.agents.DDPG(actor: Module, qf: Module, actor_optimizer: Optimizer, qf_optimizer: Optimizer, replay_buffer: TensorDictReplayBuffer, gamma: float = 0.99, batch_size: int = 32, learning_starts: int = 0, tau: float = 0.005, exploration_noise: float = 0.1, policy_frequency: int = 1, device: str = 'cpu', burning_action_func: Callable | None = None, writer: SummaryWriter | None = None, **kwargs: Any)[source]
Bases:
AbstractAgentDeep Deterministic Policy Gradient (DDPG) Agent
- Parameters:
actor (torch.nn.Module) – The actor network.
qf (torch.nn.Module) – The Q-function network.
actor_optimizer (torch.optim.Optimizer) – Optimizer for the actor network.
qf_optimizer (torch.optim.Optimizer) – Optimizer for the Q-function network.
replay_buffer (TensorDictReplayBuffer) – Replay buffer for storing experiences.
gamma (float, optional) – Discount factor for future rewards. Defaults to 0.99.
batch_size (int, optional) – Number of samples per batch for training. Defaults to 32.
learning_starts (int, optional) – Number of steps before learning starts. Defaults to 0.
tau (float, optional) – Soft update coefficient for target networks. Defaults to 0.005.
exploration_noise (float, optional) – Noise added to the actor policy during training. Defaults to 0.1.
policy_frequency (int, optional) – Frequency of delayed policy updates. Defaults to 2.
device (str, optional) – Device to run the agent on (e.g., “cpu” or “cuda”). Defaults to “cpu”.
burning_action_func (Callable, optional) – Function for generating initial exploratory actions. Defaults to None.
writer (SummaryWriter, optional) – Tensorboard writer for logging. Defaults to None.
- observe(batch_obs: Tensor, batch_actions: Tensor, batch_rewards: Tensor, batch_next_obs: Tensor, batch_dones: Tensor) None[source]
Observe the environment and store the transition in the replay buffer.
- Parameters:
batch_obs – (torch.Tensor) Tensor containing the observations. Shape: (batch_size, *)
batch_actions – (torch.Tensor) Tensor containing the actions. Shape: (batch_size, action_dim)
batch_rewards – (torch.Tensor) Tensor containing the rewards. Shape: (batch_size,)
batch_next_obs – (torch.Tensor) Tensor containing the next observations. Shape: (batch_size, *)
batch_dones – (torch.Tensor) Tensor containing the dones. Shape: (batch_size,)
- where:
batch_size is the number of transitions in the batch
action_dim is the dimension of the action space
* in the observation shape represents the dimension of the observation space, which depends on the environment and the networks (e.g., (3, 64, 64) for images or (512,) for vectors)
Note
The shapes assume a vectorized environment. For a single environment, batch_size would typically be 1.
- act_train(batch_obs: Tensor) Tensor[source]
Select an action for training.
- Parameters:
batch_obs – (torch.Tensor) Tensor containing the observations.
- Returns:
(torch.Tensor) Selected actions for training.
Example Usage
Here’s an example of how to use the DDPG agent:
import torch
import gym
from modularl.agents import DDPG
from modularl.policies import DeterministicPolicy
from modularl.q_functions import QNetwork
from modularl.replay_buffers import ReplayBuffer
import copy
# Create an environment
env = gym.make('Pendulum-v1')
# Get observation and action space dimensions
observation_shape = env.observation_space.shape[0] # 3 for Pendulum-v1
action_shape = env.action_space.shape[0] # 1 for Pendulum-v1
# Get action bounds
high_action = env.action_space.high[0] # 2.0 for Pendulum-v1
low_action = env.action_space.low[0] # -2.0 for Pendulum-v1
# Optional: Define a custom network
class CustomNetwork(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return x
custom_network_actor = CustomNetwork(observation_shape, 64)
custom_network_critic = copy.deepcopy(custom_network_actor)
# Initialize the actor and critic networks
actor = DeterministicPolicy(observation_shape, action_shape, high_action, low_action, custom_network_actor)
critic = QNetwork(observation_shape, action_shape, custom_network_critic)
# Initialize the actor and critic optimizers
actor_optimizer = torch.optim.Adam(actor.parameters(), lr=1e-4)
critic_optimizer = torch.optim.Adam(critic.parameters(), lr=1e-3)
# Initialize the replay buffer
buffer_size = 100000
replay_buffer = ReplayBuffer(buffer_size)
# Initialize the DDPG agent
agent = DDPG(
actor=actor,
qf=critic,
actor_optimizer=actor_optimizer,
q_optimizer=critic_optimizer,
replay_buffer=replay_buffer,
batch_size=64,
gamma=0.99,
tau=0.001,
exploration_noise=0.1,
device="cuda:0",
)
agent.init()
# Training loop
num_episodes = 1000
for episode in range(num_episodes):
state = env.reset()
episode_reward = 0
done = False
state = torch.tensor(state, dtype=torch.float32)
while not done:
action = agent.act_train(state)
next_state, reward, done, _ = env.step(action)
# Convert to torch tensors
next_state = torch.tensor(next_state, dtype=torch.float32)
reward = torch.tensor(reward, dtype=torch.float32)
done = torch.tensor(done, dtype=torch.float32)
# Observe and update the agent
agent.observe(state, action, reward, next_state, done)
agent.update()
state = next_state
episode_reward += reward
print(f"Episode {episode+1}, Reward: {episode_reward}")
# Save the trained agent
agent.save('ddpg_agent.pth') #Not implemented yet (consider saving the actor and critic models manually)
This example demonstrates how to:
Create a Gym environment (Pendulum-v1 in this case)
Extract necessary information from the environment (observation shape, action shape, and action bounds)
Define custom networks for the actor and critic (optional)
Initialize the DeterministicPolicy (actor) and QNetwork (critic) with appropriate parameters
Set up optimizers for the actor and critic
Create a replay buffer
Initialize the DDPG agent with all components and hyperparameters
Run a training loop to interact with the environment and update the agent
Log rewards using print statements (TensorBoard logging can be added similarly)
The main differences between DDPG and TD3 in this example are:
DDPG uses a single critic (QNetwork) instead of two critics
The hyperparameters are slightly different (e.g., tau, learning rates)
DDPG doesn’t have policy_noise, noise_clip, and policy_frequency parameters