Composing Reinforcement Learning Policies, with Formal Guarantees

or obtaining formal guarantees by synthesizing two-level controllers from verified reinforcement learning policies

Florent Delgrange

Last updated on May 22, 2025 14 min read

This blog post is based on our recent AAMAS paper, Composing Reinforcement Learning, with Formal Guarantees.

In this blog post, I will explain how to exploit learned world models to produce agent’s controllers from carefully designed reinforcement learning policies, with formal guarantees. Let’s start with a bit of background and motivations.

Motivating Example

Consider an agent that needs to deliver a package in a warehouse amid moving obstacles. Such obstacles may consist of humans, forklifts, or other robots. In such a setting, it is critical to guarantee the agent avoids the moving obstacles and delivers its package as fast as possible, both for supply chain reliability and safety reasons.

Our goal, as designer, is to provide a control policy for the agent that ensures performance and safety guarantees. Explicitly programming a control policy is often impractical, as it requires defining a set of rules by hand for every possible agent-environment interaction.

Another option is to use model-based techniques: build a model of the environment by listing all possible agent’s actions in every environment’s state as well as their subsequent transitions. Then, planning algorithms can be used to find the best policy. However, creating such a model is challenging: one has to consider all possible interactions, including those with uncertain moving obstacles. Plus, as the warehouse gets larger, the model becomes huge, and standard planning methods might struggle to scale.

It remains learning algorithms, that would allow the agent to learn through interactions how to behave optimaly in the environment. However, it is unclear how to provide guarantees with machine learning algorithms, especially on the way the agent behaves in its environment.

As a solution, we consider combining two techniques with complementary benefits and drawbacks, namely reinforcement learning and reactive synthesis, that will enable to learn policies with formal guarantees.

Controller design in complex environments

Reinforcement Learning (RL) is a machine learning paradigm that allows autonomous agents that interact with an environment to learn a control policy. The idea behind RL is simple: at each step, the agent performs an action, and it receives in response a reward along with the observation of the next environment’s state.

The typical goal of an RL agent is to learn a policy

π

that maximizes the expected accumulated rewards it receives (called the return):

E [\sum_{t = 0}^{\infty} γ^{t} \cdot r_{t} | a_{t} \sim π (\cdot ∣ s_{t})],

where

γ \in (0, 1)

is a discount factor, and

s_{t}, a_{t}, r_{t}

are respectively the environment’s state, the action performed by the agent , and the reward received at time step

t

In critical applications such as epidemic control, robotics, multi-energy systems, controlling tokamak plasmas for nuclear fusion, and controlling stratospheric balloons, RL has shown great promise in enabling agents to learn efficient policies. However, deploying RL in complex, real-world systems often hits two major challenges:

Designing a reward function that aligns the user’s intentions with the actual agent’s behaviors constitues a challenging engineering task (referred to as reward engineering).
In general, RL lacks from formal guarantees on the induced agent’s behaviors, which raises reliability concerns, especially in high stakes scenarios.

On the other hand, when a model of the environment is available, one can apply a formal verification technique called reactive synthesis to produce controller policies with formal guarantees.

This technique automatically produces policies that are guaranteed to be correct by design. One of its key advantages is that the user can specify the desired agent’s behaviors through logical formulae. For instance, considering a robot navigation scenario, one can specify the robot should reach a distant location

T

while avoiding obstacles

B

φ := ◊ T \land ◻ \neg B

, where

◊

and

◻

respectively mean “eventually” and “always” in LTL. Synthesis typically relies on an exhaustive exploration of the provided model. If it is possible to satisfy the specification in this model (the model checking phase), synthesis produces a policy that guarantees to satisfy the specification, typically by maximizing its probability:

π := \arg max_{π} P (π ⊨ φ) .

Note that, in contrast to RL, here no reward engineering is required.

However, as mentioned, reactive synthesis relies on a user-provided model of the environment.

this model is in fact rarely explicitely accessible (that is, under the form of a state-transition system or Markov model).
even when it is, as model checking exhaustively explores the model (often assumed finite), it struggles to scale to large domains, and even turns intractable in complex settings.

The complement benefits and drawbacks of RL and reactive synthesis are summarized in the following table.

Aspect	Reactive Synthesis	Reinforcement Learning
Model Dependency	Requires prior environment model	No prior model needed (simulating the environment is sufficient)
Guarantees	Formal guarantees	No formal guarantees
Specification	Logical formulas (intuitive, high-level)	Reward functions (engineering-intensive)
Scalability	Limited by the model size	Scalable to high-dimensional feature spaces (deep RL)

Our goal is to combine the two techniques to unify their advantages while mitigating their drawbacks.

The Approach

To recap, we would like to construct controller policies for agents interacting with very large environments, with formal guarantees by combining RL and synthesis techniques. Synthesis requires some knowledge of the environment, so we consider decomposing the complex environments in two levels:

A top-level that consists of a map provided as a graph, describing the known high-level structure of the environment. Here, we assume the user has some basic information about the environment, which is relatively easy to get.
A bottom level constitued of different “sections” that we call rooms. Concretely, we consider that each vertex of the high-level graph is populated by a Markov decision process (MDP, for short) for which the dynamics are unknown, i.e., the user doesn’t need to explicitely model each room.
Schematic view of a two-level model. Bottom: the environment; Top: the map; Middle: unknown environment dynamics, modelled as MDPs stitched together, one for each room.

From now on, we focus on reach-avoid properties as they are the building blocks for verifying more complex specification (you may refer to this paper if you’re further interested).

Back to our warehouse example, modeling all possible interactions the agent can have with its surroundings is complicated, but providing a map of the building is straightforward.

_Representation of a small warehouse environment (top) where the map is provided (bottom). Here, the agent (top right) needs to avoid moving obstacles (red) and reach the target location $\color{darkgreen}X$ (in green), appearing as a dedicated vertex ${\color{darkgreen}t}$ in the map._ — Representation of a small warehouse environment (top) where the map is provided (bottom). Here, the agent (top right) needs to avoid moving obstacles (red) and reach the target location $X$ (in green), appearing as a dedicated vertex $t$ in the map.

The specification considered here is to reach the the target location while avoiding moving obstacles . The idea behind our approach is to separate concerns bewteen complex, low-level environment interactions and high-level planning on the map.

First, we apply RL in parallel in each room, and for every possible exit direction, to obtain a set of low-level policies.

In addition to the policies, we employ a model-based approach to learn a world model of each room – which we call a “latent model.” The resulting model is amenable to model-checking techniques and allows for formally verifying the value of each low-level objective (reach the room’s exit while avoiding moving obstacles). The value can roughly be seen as the probability of satsifying the objective under the policy at hand (more details below).

Then, given the map, the set of low-level latent models and policies and a high-level specification, which consists here of reaching the target goal $t$ , we apply synthesis to obtain a high-level planner that selects which low-level policy to apply in each room.

The final controller, that we refer to as the two-level controller constists of the set of low-level policies learned and the high-level planner.

The Formal Guarantees

As I mentioned above, the key advantages of our approach consist of

the separation of concerns (reward engineering is only applied locally, in the rooms)
the formal guarantees associated with the synthesized, two-level controllers.

Specifically, our formal guarantees are the following.

Learning guarantees for low-level policies: the latent models of the rooms are guaranteed to be accurate abstractions. The quality of the abstraction can further be verified, and quantified through a score, formalized as probably approximately correct bounds on the loss function used to train the latent models.
Guarantees on the synthesized two-level controller: a small amount of memory for the high-level planner is necessary and sufficient for optimal control with respect to the policies learned via RL. Furthermore, such an optimal planner can be obtained by solving an MDP obtained by stitching the latent models together.
Unified learning and synthesis guarantees: the learning guarantees for the low-level policies (1) can be lifted to the two-level controller.

To ellaborate further on those guarantees, we need to go into a bit more details about what happens when the agent learns its low-level policies. Before describing how low-level policies are trained, let me formally introduce the concept of value function. This is a really important concept, as it’s on which our formal guarantees rely.

Value functions for discounted properties

Intuitively, given a reach-avoid specification $φ$ , $V_{π}^{M} (φ) := E [sup_{i \geq 0} γ^{i} \cdot 1 (s_{i} \in T \land \forall j \leq i : s_{j} \notin B)],$ where

$M$ is an MDP;
$π$ is a policy;
$γ \in (0, 1)$ is a discount factor;
$T$ and $B$ are respectively the set of states to reach and avoid in $φ$ ;
the expectation is taken over the agent’s trajectories ${(s_{t})}_{t \geq 0}$ where $s_{t}$ (probabilistically) transitions to $s_{t + 1}$ when the agent plays $a_{t} \sim π (\cdot ∣ s_{t})$ in $M$ . Here, $s_{i}$ and $s_{j}$ corresponds to the $i^{th}$ and $j^{th}$ states that occur along such trajectories.

Let’s take a closer look at this equation. Essentially, a “payoff” of $γ^{i}$ is assigned to each trajectory reaching safely the target in $i$ steps, and $i$ is the first time the target is visited along the trajectory. The value is the resulting expected payoff. When $γ \to 1$ , the it boils down to the probability of satisfying the specification $φ$ . So why not optimize this probability directly? There are several reasons to rather optimize the value function:

for computational reasons; discounting stabilitzes learning and helps handle infinite horizon properties, which otherwise simply hamper learnability
for its compliance with RL
for the semantics of the discount — it implicitly adds a timing aspect to the specification, which prioritizes faster satisfaction.

But most importantly, discounted valued properties tie to bisimulation metrics, ensuring latent models abstract rooms with high fidelity. We detail these guarantees below.

Low-level RL

Concretely, the low-level RL procedure consists of the following steps:

A low-level policy $π_{v, e}$ is learned via RL for every room $v$ and exit direction $e$ (outgoing edge of $v$ in the graph).
$π_{v, e}$ is optimized towards satisfying the low-level specification
$φ :=$ "reach the exit of the room $v$ in direction $e$ while avoiding the moving obstacles." For that objective, reward engineering may be utilized locally. If tuning a reward signal to satisfy this objective turns too difficult, one may consider dividing the room into simpler, smaller (sub)rooms.
An abstract latent model $\overset{―}{M}$ is learned for each room. As I will detail below, learning accurate abstraction of the rooms can be achieved by minimizing a carefully designed auxiliary loss to any RL objective. In our case, we combine WAE-MDPs with rainbow DQN.
The value of the policy $V_{π_{v, e}}^{\overset{―}{M}}$ is **formally verified** in the latent model $\overset{―}{M}$ .
All low-level policies/models are learned in parallel.

For each room $v$ and direction $e$ , the latent model $\overset{―}{M}$ is learned by minimizing the loss $L_{P}^{v, e}$ definined as the total variation (TV) between the transition dynamics of the room $v$ and those of the latent model $\overset{―}{M}$ :

_$\mathbf{P}$ corresponds to the **transition function** of the underlying MDP of the room $v$ and $\overline{\mathbf{P}}$ to the transition function of the latent model $\overline{\mathcal{M}}$. Additionally, $\phi$ is the learned **representation** of the states from the room $v$. The distribution $\phi\_{\sharp}\mathbf{P}$ is the [pushforward measure](https://en.wikipedia.org/wiki/Pushforward_measure) of $\mathbf{P}$ by $\phi$._ — $P$ corresponds to the **transition function** of the underlying MDP of the room $v$ and $\overset{―}{P}$ to the transition function of the latent model $\overset{―}{M}$ . Additionally, $ϕ$ is the learned **representation** of the states from the room $v$ . The distribution $ϕ_{♯} P$ is the pushforward measure of $P$ by $ϕ$ .

Doing so guarantees the quality of the abstraction:

L_{P}^{v, e}

bounds the difference of values between the original room

v

and the learned latent model, with

| V_{π_{v, e}}^{v} (φ) - V_{π_{v, e}}^{\overset{―}{M}} (φ) | \leq AEL (π_{v, e}) \cdot \frac{L_{P}^{v, e}}{1 - γ},

where

AEL (π_{v, e})

corresponds to the average length of the RL episodes when

π_{v, e}

is executed. Stated otherwise, as

L_{P}^{v, e}

goes to zero, the room

v

and the model

\overset{―}{M}

becomes almost surely equivalent for the specification

φ

. Upper bounds on both

L_{P}^{v, e}

and this value difference bound are PAC-verifiable (they can be *probably approximately correctly* evaluated), meaning one can formally verify the quality of the abstraction.

Finally, we showed that constructing a “compressed MDP” using the map’s structure (graph topology) and the values of the low-level specifications as transition probabilities is sufficient for optimal control with respect to the learned policies. Solving this MDP allows obtaining the high-level planner. Importantly, we proved that minimizing each individual room’s loss ensures the compressed MDP’s planner values align with the global two-level objective! So, the guarantees obtained when learning the low-level policies can be lifted to the high-level planner.

Case studies

To evaluate our approach, we considered scenarios involving an agent navigating amid moving obstacles, in a large grid world and a custom Doom environment with visual inputs. Note that both state spaces are gigantic. In the grid world, each room has size 20x20 cells and contains different obstacles/items. The grid world’s observation consists of a bitmap representing the current room (with one dimension per item/obstacle/adversary type). For ViZDoom, a game frame is provided with additional info, such as the player’s velocity, health, angle, and ammunition.

In both environments, adversaries have different (probabilistic) behaviors, and can move between rooms. In the ViZDoom scenario, random adversaries also spawn randomly in the map.

_Map of the ViZDoom environment. The green star corresponds to the high-level objective and red dots to adversaries._ — *Map of the ViZDoom environment. The green star corresponds to the high-level objective and red dots to adversaries.*

_The low level policies are learned in parallel in different rooms, in every possible directions_ — *The low level policies are learned in parallel in different rooms, in every possible directions*

A video of a synthesized two-level controller for ViZDoom is shown below:

Another video of a synthesized controller for a 9x9 grid world can be found here.

With WAE-MDPs' representation learning capabilities, we only needed to train one latent model and policy per direction, which allowed us to group similar rooms together! To further emphasize low-level policies' reusability, we changed the map’s layout, increased the number of rooms (from 9 to 49), and raised the number of adversaries (from 11 to 47 in the grid world, and 8 to 20 in ViZDoom). Results are shown below.

**Table 1**: _Synthesis for a discount factor $\gamma = 0.99$. $N$ corresponds to the number of rooms, **LP** to the life points of the agent, and $A$ to the amount of adversaries in the environment. **Avg. return** is the observed, empirical probability of reaching the high-level goal when running the synthesized two-level controller in the environment. This metric serves as a reference for the controller's performance. **Latent value** is the predicted discounted value of the high-level objective computed in the latent model. **Avg. value** is the empirical discounted value of this objective approximated by simulating the environment under the controller._ — **Table 1**: Synthesis for a discount factor $γ = 0.99$ . $N$ corresponds to the number of rooms, LP to the life points of the agent, and $A$ to the amount of adversaries in the environment. **Avg. return** is the observed, empirical probability of reaching the high-level goal when running the synthesized two-level controller in the environment. This metric serves as a reference for the controller’s performance. **Latent value** is the predicted discounted value of the high-level objective computed in the latent model. **Avg. value** is the empirical discounted value of this objective approximated by simulating the environment under the controller.

**Table 2**: PAC bounds demonstrating the abstraction quality of the latent models learned. Each bound is evaluated for a particular policy $\pi_e$, where $e$ corresponds to the direction in which the policy is applied in the rooms. In short, the numerical values presented in the table range in the interval $[0,1]$ and gives an **_abstraction score_**. The lower the score, the more accurate the abstraction of the latent model learned is guaranteed. — **Table 2**: PAC bounds demonstrating the abstraction quality of the latent models learned. Each bound is evaluated for a particular policy $π_{e}$ , where $e$ corresponds to the direction in which the policy is applied in the rooms. In short, the numerical values presented in the table range in the interval $[0, 1]$ and gives an ***abstraction score***. The lower the score, the more accurate the abstraction of the latent model learned is guaranteed.

These results align with our theory:

our two-level controllers achieve high success probability, demonstrating that our approach alleviate reward engineering issues (e.g., dealing with sparse rewards)
the values predicted in the latent model are consistent and even close to those observed in models where the abstraction loss is proven to be low (here, in ViZDoom; see Table 2).

Discussion

In summary, we showed that carefully composing RL policies enable the application of reactive synthesis techniques to get formally verified controllers. We don’t know of any other synthesis techniques which can deal with such large environments under the same assumptions. Furthermore, our technique also allows providing controllers in domains where standard RL struggles due to the need of extensive reward engineering (e.g., because of sparse rewards). Guarantees are critical in high-stakes scenarios, and we believe drawing on ideas from and incorporating formal verification into RL is an important direction for improving the reliability of learning agents.