SAC (Soft Actor-Critic)

Agent Class

class modularl.agents.SAC(actor: Module, qf1: Module, qf2: Module, actor_optimizer: Optimizer, qf_optimizer: Optimizer, replay_buffer: TensorDictReplayBuffer, gamma: float = 0.99, entropy_lr: float = 0.001, batch_size: int = 32, learning_starts: int = 0, entropy_temperature: float = 0.2, target_entropy: float | None = None, tau: float = 0.005, policy_frequency: int = 1, target_network_frequency: int = 2, device: str = 'cpu', burning_action_func: Callable | None = None, writer: SummaryWriter | None = None, **kwargs: Any)[source]

Bases: AbstractAgent

Soft Actor-Critic (SAC) Agent

Parameters:

actor (torch.nn.Module) – The actor network (policy) to be used.
qf1 (torch.nn.Module) – The first Q-function network.
qf2 (torch.nn.Module) – The second Q-function network.
actor_optimizer (torch.optim.Optimizer) – Optimizer for the actor network.
qf_optimizer (torch.optim.Optimizer) – Optimizer for both Q-function networks.
replay_buffer (TensorDictReplayBuffer) – Replay buffer for storing experiences.
gamma (float, optional) – Discount factor for future rewards. Defaults to 0.99.
entropy_lr (float, optional) – Learning rate for the entropy temperature. Defaults to 1e-3.
batch_size (int, optional) – Number of samples per batch for training. Defaults to 32.
learning_starts (int, optional) – Number of steps before learning starts. Defaults to 0.
entropy_temperature (float, optional) – Initial entropy temperature. Defaults to 0.2.
target_entropy (float, optional) – Target entropy for adaptive temperature adjustment. Defaults to None.
tau (float, optional) – Soft update coefficient for target networks. Defaults to 0.005.
policy_frequency (int, optional) – Frequency of policy updates. Defaults to 1.
target_network_frequency (int, optional) – Frequency of target network updates. Defaults to 2.
device (str, optional) – Device to run the agent on (e.g., “cpu” or “cuda”). Defaults to “cpu”.
burning_action_func (Callable, optional) – Function for generating initial exploratory actions. Defaults to None.
writer (SummaryWriter, optional) – Tensorboard writer for logging. Defaults to None.

init() → None[source]: Initialize the agent.

observe(batch_obs: Tensor, batch_actions: Tensor, batch_rewards: Tensor, batch_next_obs: Tensor, batch_dones: Tensor) → None[source]

Observe the environment and store the transition in the replay buffer.

Parameters:

batch_obs – (torch.Tensor) Tensor containing the observations. Shape: (batch_size, *)
batch_actions – (torch.Tensor) Tensor containing the actions. Shape: (batch_size, action_dim)
batch_rewards – (torch.Tensor) Tensor containing the rewards. Shape: (batch_size,)
batch_next_obs – (torch.Tensor) Tensor containing the next observations. Shape: (batch_size, *)
batch_dones – (torch.Tensor) Tensor containing the dones. Shape: (batch_size,)

where:

batch_size is the number of transitions in the batch
action_dim is the dimension of the action space
* in the observation shape represents the dimension of the observation space, which depends on the environment and the networks (e.g., (3, 64, 64) for images or (512,) for vectors)

Note

The shapes assume a vectorized environment. For a single environment, batch_size would typically be 1.

act_train(batch_obs: Tensor) → Tensor[source]

Generate actions for training based on the current policy. It uses a burning action function for initial exploration if specified, then switches to the learned policy.

Parameters:: batch_obs – (torch.Tensor) A batch of observations from the environment.
Returns:: (torch.Tensor) A batch of actions to be taken in the environment.

Notes

If the global step is less than learning_starts and a burning action function is provided, it uses that function for exploration.
Otherwise, it uses the current policy (actor) to generate actions.

act_eval(batch_obs: Tensor) → Tensor[source]

Select an action for evaluation.

Parameters:: batch_obs – (torch.Tensor) Tensor containing the observation.
Returns:: (torch.Tensor) Selected action for evaluation.

update() → None[source]: Perform a training update.

Example Usage

Here’s an example of how to use the SAC agent:

import torch
import gym
from modularl.agents import SAC
from modularl.policies import GaussianPolicy
from modularl.q_functions import SAQNetwork
from modularl.replay_buffers import ReplayBuffer
import copy
# Create an environment
env = gym.make('Pendulum-v1')
# Get observation and action space dimensions
observation_shape = env.observation_space.shape[0]  # 3 for Pendulum-v1
action_shape = env.action_space.shape[0]  # 1 for Pendulum-v1

# Get action bounds
high_action = env.action_space.high[0]  # 2.0 for Pendulum-v1
low_action = env.action_space.low[0]  # -2.0 for Pendulum-v1

# Optional: Define a custom network
class CustomNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return x

custom_network_actor = CustomNetwork(observation_shape, 64)
custom_network_critic = copy.deepcopy(custom_network_actor)
# Initialize the actor and critic networks
actor = GaussianPolicy(obsevation_shape, action_shape, high_action, low_action,custom_network_actor)
critic = SAQNetwork(observation_shape, action_shape,custom_network_critic)

#Initialize the actor and critics optimizers
actor_optimizer = torch.optim.Adam(actor.parameters(), lr=3e-4)
critic_optimizer = torch.optim.Adam(critic.parameters(), lr=3e-4)

# Initialize the replay buffer
buffer_size = 100000
replay_buffer = ReplayBuffer(buffer_size,sampling="random")

# Optional: Initialize the tensorboard writer
writer = SummaryWriter()
# Initialize the SAC agent
num_episodes = 10000
learning_starts = 1000
agent = SAC(
    actor=actor,
    critic=critic,
    actor_optimizer=actor_optimizer,
    critic_optimizer=critic_optimizer,
    replay_buffer = replay_buffer,
    learning_starts=learning_starts, # Number of steps before starting training (the total agents steps get's incremented by 1 evert agent.observe() call)
    entropy_lr=3e-4,
    batch_size=64,
    gamma=0.99,
    tau=0.005,
    device="cuda:0",
    target_network_frequency=2,
    target_entropy=-action_shape,
    writer=writer,
)
agent.init()
# Training loop

for episode in range(num_episodes):
    state = env.reset()
    episode_reward = 0
    done = False
    state = torch.tensor(state, dtype=torch.float32)
    while not done:

        action,_,_ = agent.act_train(state)
        next_state, reward, done, _ = env.step(action)
        # Convert to torch tensors
        next_state = torch.tensor(next_state, dtype=torch.float32)
        reward = torch.tensor(reward, dtype=torch.float32)
        done = torch.tensor(done, dtype=torch.float32)

        # Update the agent
        agent.update(state, action, reward, next_state, done)

        state = next_state
        episode_reward += reward
    writer.add_scalar('reward', episode_reward, episode)
    print(f"Episode {episode+1}, Reward: {episode_reward}")

# Save the trained agent
agent.save('sac_agent.pth') #Not implemented yet (consider saving the actor and critic models manually)

This example demonstrates how to:

Create a Gym environment (Pendulum-v1 in this case)
Extract necessary information from the environment (observation shape, action shape, and action bounds)
Define custom networks for the actor and critic (optional)
Initialize the GaussianPolicy (actor) and SAQNetwork (critic) with appropriate parameters
Set up optimizers for the actor and critic
Create a replay buffer
Initialize the SAC agent with all components and hyperparameters
Run a training loop to interact with the environment and update the agent
Log rewards using TensorBoard