An Optimisation Framework for Unsupervised Environment Design

Nathan Monette *,†

UC Irvine

Alistair Letcher

University of Oxford

Michael Beukman

University of Oxford

Matthew T. Jackson

University of Oxford

Alexander Rutherford

University of Oxford

Alexander D. Goldie

University of Oxford

Jakob N. Foerster

University of Oxford

Reinforcement Learning Conference 2025

*Work undertaken while visiting the University of Oxford, Correspondence to nmonette at uci dot edu

TL;DR

Unsupervised Environment Design (UED) is a reinforcement learning paradigm that aims to train an agent to be strong across variants of a configurable environment (levels). UED is often seen as a game between the agent that interacts with the environment and the adversary that proposes training levels. Our method NCC formally examines this concept, and alters the optimisation process in a way that is compliant with provably convergent methods from the optimisation literature. We then loosen our assumptions to make empirical gains on prior UED methods.

Motivation

Recent UED research has focused on optimizing performance on popular benchmarks, but has forgone theoretical clarity as a result. While many of the ideas discussed in the foundational UED works like PAIRED are founded in game theoretic reasoning, in practice their use of neural networks for both the agent and the adversary creates an objective that is hard to analyse (theoretically similar to the GAN). Our goal in making this paper was to create a UED method that finally has guaranteed convergence, and then build out from there.

Why do we care about a theoretically convergent algorithm if the practical version doesn’t converge?

Robustness Results

In our method, the adversary’s objective is the expected score of the policy over the levels, generally meaning the policy’s room for improvement on each level. On the Minigrid benchmark, we have access to a regret oracle to use for the score function. For this setting, our practical (not even provably convergent) method (NCC-Reg) manages to strongly outperform all of our baselines!

Our alpha-CvAR results
Our α\alpha-CVaR evaluation on the Minigrid Benchmark.

Method

Our algorithmic design is reminiscent of PLR, which curates a distribution of levels to be replayed, but in order to attain guarantees, we optimise the adversary with the gradient of the policy’s entropy-regularised score on the levels! By using two-timescale gradient dynamics, we are then able to prove convergence for the agent.

A diagram of our method's training loop.
Our method’s training loop, which involves gradient optimisation for both the agent and adversary.

We then extend our method to include a dynamic level buffer similar to PLR, and also use more contemporary model-free RL methods standard to UED (like PPO). Even though these additions break the assumptions of our theory, they improve empirical performance.

Analysis

We provide theoretical and empirical analysis throughout our paper, with regards to convergence, as well as practicalities such as the choice of score function.

Theoretical vs. Practical

We provide a formal proof for our rate of convergence under mild assumptions for the neural architecture. Despite following stronger assumptions than our practical method, our method still remains competitive in the theoretical regime (NCC-T)!

A comparison of performance of different types of our algorithm.
Minigrid performance with different versions of our method (NCC-T is the theoretical version and NCC-Learn/NCC-Reg are the practical variants with learnability and regret respectively).

A New Score Function - Generalised Learnability

SFL analysed different level score functions, and determined that learnability, otherwise known as the variance of level-success in a goal-based environment, is better for learning than regret approximations. We generalise this notion to the variance of returns of a deterministic environment, however we scale this with a gaussian to help focus on levels of intermediate difficulty. We repeat the analysis of SFL below on this new score function, generalised learnability.

A histogram that compares our score function and solve rate
A comparison between our new score function and solve rates on Minigrid. The blue points are data collected from a Minigrid training run and the black curve is a Gaussian fit to that data.

What’s Next?

We are excited to see further work building off of ours! Some research directions include using bilevel optimisation theoretically to analyse our setting under any score function, as well as analysing more practical variants of our method, such as with better gradient estimators than REINFORCE. Another direction would be to handle the nonstationarity associated with a dynamic buffer. Please don’t hesitate to reach out if you have any questions!

BibTeX citation

    @inproceedings{
  monette2025an,
  title={An Optimisation Framework for Unsupervised Environment Design},
  author={Nathan Monette and Alistair Letcher and Michael Beukman and Matthew Thomas Jackson and Alexander Rutherford and Alexander David Goldie and Jakob Nicolaus Foerster},
  booktitle={Reinforcement Learning Conference},
  year={2025},
  url={https://openreview.net/forum?id=WnknYUybWX}
}