World Models: Recurrent World Models Facilitate Policy Evolution
Source
- Raw Markdown: paper_world-models-2018.md
- PDF: paper_world-models-2018.pdf
- Preprint: https://arxiv.org/abs/1803.10122
- Official interactive article: https://worldmodels.github.io/
- NeurIPS paper: https://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution
- Official code: https://github.com/hardmaru/WorldModelsExperiments
- Article source: https://github.com/worldmodels/worldmodels.github.io
Credibility
This is a 2018 NeurIPS source by David Ha and Juergen Schmidhuber. It is older than one year and is not current SOTA, but it is a canonical historical anchor for modern visual model-based RL and action-conditioned world-model language. Use it as landmark background for the VAE + MDN-RNN + controller decomposition, learned latent rollout, and model exploitation caveats, then compare current claims against newer JEPA, Dreamer-style, latent-diffusion, and foundation-model world-model sources.
Core Claim
A compact controller can solve visual control tasks when it acts through a learned latent world model: a VAE compresses image observations into latent codes, an MDN-RNN predicts future latent codes conditioned on action and recurrent state, and a small controller uses the latent and recurrent state to choose actions.
Key Contributions
- Popularized the V/M/C decomposition: vision encoder, memory/dynamics model, and small controller.
- Trained the dynamics model from random-policy rollouts without reward supervision, then optimized the controller with CMA-ES.
- Demonstrated that the controller can use recurrent predictive state directly for reflex-like action selection.
- Showed an agent trained inside a learned latent “dream” environment can transfer back to the actual VizDoom environment.
- Made world-model exploitation concrete: policies can discover adversarial behavior that works in the learned simulator but fails in the real environment.
- Used MDN-RNN temperature as a practical uncertainty knob to reduce reward hacking against an imperfect model.
- Sketched an iterative training loop where model prediction loss can drive curiosity and data collection in unfamiliar parts of the environment.
Method Notes
The central dynamics interface is:
For Doom, the model also predicts the termination event:
flowchart LR Obs["image observation"] VAE["V: VAE encoder"] Z["latent code z_t"] Action["action a_t"] RNN["M: MDN-RNN"] H["recurrent state h_t"] C["C: small controller"] Env["environment or learned dream"] Obs --> VAE --> Z Z --> C H --> C C --> Action --> Env --> Obs Z --> RNN Action --> RNN H --> RNN --> H
For this wiki, the important interface is not the exact 2018 architecture. It is the action-conditioned latent transition contract: observation history + action -> future latent state, plus a controller or planner that consumes that state before acting.
Evidence And Results
- In
CarRacing-v0, the full VAE + MDN-RNN world model with a linear controller reports906 +/- 21over 100 random trials, compared with632 +/- 251for the V-only linear controller and788 +/- 141for the V-only controller with a hidden layer. - In VizDoom
TakeCover, the paper trains the controller in the learned latent environment and reports transfer back to the actual environment with1092 +/- 556time steps survived at the selected temperature. - The paper reports that low-temperature deterministic dreams can create policies that exploit model errors and fail in the actual environment, while moderate stochasticity improves transfer.
- The experiments use visual game trajectories, explicit actions, and rollout rewards, so they are strong evidence for a narrow visual-control setting rather than a general digital-system world model.
Limitations
- The architecture is now historical: VAE + MDN-RNN + CMA-ES is not the current frontier for large-scale world models.
- Training is staged rather than end-to-end, and the VAE may preserve visually salient but task-irrelevant detail while dropping task-relevant detail.
- The controller can exploit the learned simulator, especially when the dynamics model is too deterministic or out of distribution.
- The evidence is from OpenAI Gym game environments, not multivariate operational time series, irregular event streams, robotics manipulation at scale, or real digital-system interventions.
- The work does not solve long-horizon hierarchical planning, large memory capacity, continual update, or cross-system transfer.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Control and counterfactuals | partially closes | Learns latent next-state dynamics conditioned on explicit actions and uses imagined rollouts to train or evaluate a controller. | Visual Gym environments only; no digital-system actions, operator interventions, confounding analysis, or real operational telemetry. |
| Multi-modal future distributions | partially closes | Uses an MDN-RNN to model a distribution over next latent states and exposes temperature as an uncertainty/exploitability control. | No calibrated decision-facing future distributions for numeric time-series systems. |
| Representation quality: semantic state vs dense detail | warning | Shows that compressed visual latents can support control, but also that VAE reconstruction can preserve irrelevant detail and miss task-relevant detail. | Needs task-conditioned or representation-space objectives that retain intervention-relevant state under scale. |
| Benchmarks: what level of modeling is tested | warning | The CarRacing and VizDoom tests include action-conditioned rollouts and transfer from learned simulator to real environment. | Benchmarks are small game environments and can be exploited by simulator-specific policies. |
Links Into The Wiki
- World Models
- Foundation Time-Series Model Research Agenda
- Latent-Space Predictive Learning
- Evolution Strategies
- Time-Series Benchmark Hygiene
- LeWorldModel
- Reconstruction Or Semantics?
- RAEv2
Open Questions
- Which modern world-model architectures preserve the useful action-conditioned interface while avoiding 2018-era simulator exploitation?
- How should uncertainty be represented so a controller can avoid brittle high-reward hallucinations without making the learned environment too noisy to plan in?
- Can the V/M/C decomposition be transferred to observability or digital-system time series as
encoder + action-conditioned latent dynamics + intervention-ranking controller? - What benchmark would test the same “learn inside the model, transfer outside it” claim for operator actions, rollbacks, autoscaling, or remediation choices?
- Can prediction-loss-driven curiosity become a safe curriculum signal without over-sampling corrupt, adversarial, or irrelevant high-surprise states?