WAE-MDPs | Florent Delgrange

WBU

Source code for replicating the expriments presented in the paper The Wasserstein Believer — Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

Controller Synthesis from Deep Reinforcement Learning Policies

We propose a novel framework to controller design in environments with a two-level structure: a high-level graph in which each vertex is populated by a Markov decision process, called a ``room', with several low-level objectives. We proceed as follows. First, we apply deep reinforcement learning (DRL) to obtain low-level policies for each room and objective. Second, we apply reactive synthesis to obtain a planner that selects which low-level policy to apply in each room. Reactive synthesis refers to constructing a planner for a given model of the environment that satisfies a given objective (typically specified as a temporal logic formula) by design. The main advantage of the framework is formal guarantees. In addition, the framework enables a “separation of concerns”: low-level tasks are addressed using DRL, which enables scaling to large rooms of unknown dynamics, reward engineering is only done locally, and policies can be reused, whereas users can specify high-level tasks intuitively and naturally. The central challenge in synthesis is the need for a model of the rooms. We address this challenge by developing a DRL procedure to train concise “latent” policies together with latent abstract rooms, both paired with PAC guarantees on performance and abstraction quality. Unlike previous approaches, this circumvents a model distillation step. We demonstrate feasibility in a case study involving agent navigation in an environment with moving obstacles

Florent Delgrange, Guy Avni, Anna Lukina, Christian Schilling, Ann Nowé, Guillermo A. Pérez

The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

Partially Observable Markov Decision Processes (POMDPs) are useful tools to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Keeping a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is also intractable. Current state-of-the-art algorithms use Recurrent Neural Networks (RNNs) to compress the observation-action history aiming to learn a sufficient statistic, but they lack guarantees of success and can lead to suboptimal policies. To overcome this, we propose the Wasserstein-Belief-Updater (WBU), an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.

Raphael Avalos, Florent Delgrange, Ann Nowé, Guillermo A. Pérez, Diederik M. Roijers

The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

WAE-PCN: Wasserstein-autoencoded Pareto Conditioned Networks

In real-world problems, decision makers often have to balance multiple objectives, which can result in trade-offs. One approach to finding a compromise is to use a multi-objective approach, which builds a set of all optimal trade-offs called a Pareto front. Learning the Pareto front requires exploring many different parts of the state- space, which can be time-consuming and increase the chances of encountering undesired or dangerous parts of the state-space. In this preliminary work, we propose a method that combines two frameworks, Pareto Conditioned Networks (PCN) and Wasserstein auto-encoded MDPs (WAE-MDPs), to efficiently learn all possible trade-offs while providing formal guarantees on the learned poli- cies. The proposed method learns the Pareto-optimal policies while providing safety and performance guarantees, especially towards unexpected events, in the multi-objective setting.

Florent Delgrange, Mathieu Reymond, Ann Nowé, Guillermo A. Pérez

WAE-MDPs

Source code for replicating the expriments presented in the paper Wasserstein Auto-encoded MDPs — Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.

Florent Delgrange, Ann Nowé, Guillermo Perez

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees