WAE-MDPs

Wasserstein Auto-encoded Markov Decision Processes

Last updated on Oct 17, 2025

Official implementation of the Wasserstein Auto-encoded MDP (WAE-MDP) framework. WAE-MDPs enable the distilation of (any kind of) reinforcement learning policies into simpler controllers, paired with a discrete, tractable model of the environment (a latent space model). The two are provided with bisimulation guarantees, which allow formally verifying the behaviors of the agent operating under the simplified policy. The source code provided allows for replicating the experiments of the paper Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees.

The source code is available on GitHub.

Deep RL generally lacks guarantees but are very effective to solve complex tasks in intricate environments. On the other hand, model-checking techniques allow to verify agent’s policies, but need the access to an environment’s model. Since model-checking usually relies on an exhaustive exploration of the input model, the latter must in general be finite (discrete spaces) and tractable.

WAE-MDPs distill RL policies learned via any RL technique (in particular, through deep neural networks) to learn a discrete, latent representation of the intractable input space. The distillation yields both a simpler policy and a world model that are amenable to model-checking.

One of the main features of WAE-MDPs is that the quality of the abstraction can be formally checked through PAC bounds on bisimulation pseudometrics. In short, the world model is guaranteed to be bisimilarly close to the real environment; it replicates closely the agent-environment interaction behaviors through its latent space. In addition, WAE-MDPs are equipped with representation guarantees: the representation learned ensures that input states that are grouped to the same latent state are bisimilarly close (the agent behaves the same way from those states).

Beyond those PAC-verifiable guarantees, the bisimulation guarantees are also maintained during learning. This is in contrast to VAE-MDPs, which learn a surrogate, variational objective to avoid the intractable computation of the Wasserstein metric, in the fixpoint definition of bisimulation. Instead, WAE-MDPs soundly approximate this computation by learning a discriminator (“max”) that distinguishes between states that are likely to be produced in the learned world model, and those of the real environment. This yields a $2$ -player game objective (min/max) where the goal of the minimizer is to learn both a representation and a world model that makes the latent states generated by the encoder and the world model indistinguishable.

By doing so, WAE-MDPs avoid posterior collapse issues, which drastically speeds up learning, and yield distilled policies and world models of better quality.

WAE-MDPs