Skip to content

Agent base

agent

Common abstract base class for trainable agents in rlib.

Almost every agent in :mod:rlib reimplements the same boilerplate:

  • store the same set of LR / grad-clip hyperparameters,
  • build a polynomial-decay LR scheduler on top of an optimiser, and
  • run an identical loss.backward → clip_grad_norm_ → optimiser.step → zero_grad → scheduler.step sequence at the end of every backprop call.

:class:Agent factors that out into a single abstract base so agent implementations can focus on what is actually agent-specific (their forward pass, evaluate signature, loss formulation, and backprop signature).

The base class deliberately does not know about any specific RL algorithm. Algorithm-specific loss functions live on per-algorithm subclasses (e.g. :class:rlib.A2C.A2CModel) which their concrete variants (feed-forward, recurrent, ...) can inherit and reuse.

Subclasses are expected to:

  1. Call super().__init__(config=...) with a :class:ModelConfig (or subclass).
  2. Build their network heads (policy/value/Q/etc.) attached to self.device.
  3. Optionally call :meth:Agent._build_optimiser once they have all their parameters in place. Composite agents that delegate optimisation to a child can simply skip this step.
  4. Implement forward (inherited from :class:torch.nn.Module) and the abstract :meth:backprop and :meth:evaluate methods.
  5. End each backprop method with return self._train_step(loss).

Per-agent ModelConfig subclasses (A2CConfig, PPOConfig, ...) live next to their respective agent class in rlib/<Agent>/model.py.

ModelConfig dataclass

ModelConfig(lr: float = 0.001, lr_final: float = 0.0, decay_steps: int = 600000, grad_clip: float | None = 0.5, device: str = 'cuda')

Shared hyperparameters for every :class:Agent.

Attributes:

Name Type Description
lr float

Initial learning rate.

lr_final float

Final learning rate the polynomial scheduler decays to.

decay_steps int

Optimiser steps over which the LR decays from lr to lr_final. Use lr_final == lr for a constant LR.

grad_clip float | None

Maximum gradient norm; None disables clipping.

device str

Torch device string ("cuda", "cpu", ...).

Agent

Agent(config: ModelConfig)

Bases: Module, ABC

Abstract base class for trainable rlib agent models.

Parameters:

Name Type Description Default
config ModelConfig

A :class:ModelConfig (or subclass) holding LR / scheduler / device hyperparameters.

required

Attributes:

Name Type Description
config ModelConfig

The original config object (immutable).

lr, (lr_final, decay_steps, grad_clip, device)

Convenience mirrors of the corresponding config fields.

Source code in rlib/agent.py
90
91
92
93
94
95
96
97
98
99
def __init__(self, config: ModelConfig) -> None:
    super().__init__()
    self.config = config
    # Mirror the most-frequently-read fields onto ``self`` for
    # ergonomics — call sites read ``self.lr`` etc. directly.
    self.lr = config.lr
    self.lr_final = config.lr_final
    self.decay_steps = config.decay_steps
    self.grad_clip = config.grad_clip
    self.device = config.device

evaluate abstractmethod

evaluate(*args: Any, **kwargs: Any) -> Any

Numpy-in / numpy-out inference call used by agent rollouts.

The exact signature is agent-specific (feed-forward agents take a single observation array, recurrent agents also take a hidden-state tuple, etc.), so each concrete subclass declares its own. Implementations should run :meth:forward under torch.no_grad() and convert torch tensors back to numpy.

Source code in rlib/agent.py
108
109
110
111
112
113
114
115
116
117
@abstractmethod
def evaluate(self, *args: Any, **kwargs: Any) -> Any:
    """Numpy-in / numpy-out inference call used by agent rollouts.

    The exact signature is agent-specific (feed-forward agents take
    a single observation array, recurrent agents also take a
    hidden-state tuple, etc.), so each concrete subclass declares
    its own.  Implementations should run :meth:`forward` under
    ``torch.no_grad()`` and convert torch tensors back to numpy.
    """

backprop abstractmethod

backprop(*args: Any, **kwargs: Any) -> np.ndarray

Run a full training step (numpy in / numpy loss-scalar out).

Implementations typically: convert numpy inputs to tensors, call :meth:forward, compute the agent-specific loss, then return self._train_step(loss).

Source code in rlib/agent.py
119
120
121
122
123
124
125
126
@abstractmethod
def backprop(self, *args: Any, **kwargs: Any) -> np.ndarray:
    """Run a full training step (numpy in / numpy loss-scalar out).

    Implementations typically: convert numpy inputs to tensors,
    call :meth:`forward`, compute the agent-specific loss, then
    return ``self._train_step(loss)``.
    """

value_loss staticmethod

value_loss(R: Tensor, V: Tensor) -> torch.Tensor

Half mean-squared-error between targets R and values V.

Used by every value-based agent in this library (A2C critic, PPO critic, DQN, Q-aux, ...). Algorithm-specific losses (A2C actor-critic loss, PPO clipped objective, ...) belong on the relevant per-algorithm :class:Model subclass.

Source code in rlib/agent.py
153
154
155
156
157
158
159
160
161
162
@staticmethod
def value_loss(R: torch.Tensor, V: torch.Tensor) -> torch.Tensor:
    """Half mean-squared-error between targets ``R`` and values ``V``.

    Used by every value-based agent in this library (A2C critic,
    PPO critic, DQN, Q-aux, ...). Algorithm-specific losses (A2C
    actor-critic loss, PPO clipped objective, ...) belong on the
    relevant per-algorithm :class:`Model` subclass.
    """
    return 0.5 * torch.mean(torch.square(R - V))