Unsupervised Environment Design (UED) is a reinforcement learning paradigm that aims to train an agent to be strong across variants of a configurable environment (levels). UED is often seen as a game between the agent that interacts with the environment and the adversary that proposes training levels. Our method NCC formally examines this concept, and alters the optimisation process in a way that is compliant with provably convergent methods from the optimisation literature. We then loosen our assumptions to make empirical gains on prior UED methods.
Recent UED research has focused on optimizing performance on popular benchmarks, but has forgone theoretical clarity as a result. While many of the ideas discussed in the foundational UED works like PAIRED are founded in game theoretic reasoning, in practice their use of neural networks for both the agent and the adversary creates an objective that is hard to analyse (theoretically similar to the GAN). Our goal in making this paper was to create a UED method that finally has guaranteed convergence, and then build out from there.
In our method, the adversary’s objective is the expected score of the policy over the levels, generally meaning the policy’s room for improvement on each level. On the Minigrid benchmark, we have access to a regret oracle to use for the score function. For this setting, our practical (not even provably convergent) method (NCC-Reg) manages to strongly outperform all of our baselines!
Our algorithmic design is reminiscent of PLR, which curates a distribution of levels to be replayed, but in order to attain guarantees, we optimise the adversary with the gradient of the policy’s entropy-regularised score on the levels! By using two-timescale gradient dynamics, we are then able to prove convergence for the agent.
We then extend our method to include a dynamic level buffer similar to PLR, and also use more contemporary model-free RL methods standard to UED (like PPO). Even though these additions break the assumptions of our theory, they improve empirical performance.
We provide theoretical and empirical analysis throughout our paper, with regards to convergence, as well as practicalities such as the choice of score function.
We provide a formal proof for our rate of convergence under mild assumptions for the neural architecture. Despite following stronger assumptions than our practical method, our method still remains competitive in the theoretical regime (NCC-T)!
SFL analysed different level score functions, and determined that learnability, otherwise known as the variance of level-success in a goal-based environment, is better for learning than regret approximations. We generalise this notion to the variance of returns of a deterministic environment, however we scale this with a gaussian to help focus on levels of intermediate difficulty. We repeat the analysis of SFL below on this new score function, generalised learnability.
We are excited to see further work building off of ours! Some research directions include using bilevel optimisation theoretically to analyse our setting under any score function, as well as analysing more practical variants of our method, such as with better gradient estimators than REINFORCE. Another direction would be to handle the nonstationarity associated with a dynamic buffer. Please don’t hesitate to reach out if you have any questions!
@inproceedings{
monette2025an,
title={An Optimisation Framework for Unsupervised Environment Design},
author={Nathan Monette and Alistair Letcher and Michael Beukman and Matthew Thomas Jackson and Alexander Rutherford and Alexander David Goldie and Jakob Nicolaus Foerster},
booktitle={Reinforcement Learning Conference},
year={2025},
url={https://openreview.net/forum?id=WnknYUybWX}
}