---
abstract: |
  As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term *world model* carries different meanings across research communities. We introduce a \`\`levels $\times$ laws" taxonomy organized along two axes. The first defines three capability levels: **L1 Predictor**, which learns one-step local transition operators; **L2 Simulator**, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and **L3 Evolver**, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes (physical, digital, social, and scientific) that determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level--regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
author:
- |
  Meng Chu$^{1\dagger}$, Xuan Billy Zhang$^{2\dagger\text{\faCube}}$, Kevin Qinghong Lin$^{3\dagger}$, Lingdong Kong$^{2\dagger}$,\
  Jize Zhang$^{3\dagger}$, Teng Tu$^{2\dagger}$, Weijian Ma$^{2\dagger}$, Ziqi Huang$^{4}$, Senqiao Yang$^{5}$, Wei Huang$^{6}$,\
  Yeying Jin$^{2}$, Zhefan Rao$^{1}$, Jinhui Ye$^{1}$, Xinyu Lin$^{2}$, Xichen Zhang$^{1}$, Qisheng Hu$^{4}$,\
  Shuai Yang$^{6}$, Leyang Shen$^{2}$, Wei Chow$^{2}$, Yifei Dong$^{7}$, Fengyi Wu$^{7}$, Quanyu Long$^{4}$,\
  Bin Xia$^{5}$ Shaozuo Yu$^{5}$, Mingkang Zhu$^{5}$, Wenhu Zhang$^{1}$, Jiehui Huang$^{1}$,\
  Haokun Gui$^{1}$, Haoxuan Che$^{1\S}$, Long Chen$^{1\S}$, Qifeng Chen$^{1\S}$, Wenxuan Zhang$^{9\S}$,\
  Wenya Wang$^{4\S}$, Xiaojuan Qi$^{6\S}$, Yang Deng$^{10\S}$, Yanwei Li$^{5\S}$, Mike Zheng Shou$^{2\S}$,\
  Zhi-Qi Cheng$^{7\S}$, See-Kiong Ng$^{2\S}$, Ziwei Liu$^{4\S}$, Philip Torr$^{3\S}$, Jiaya Jia$^{1\S}$\
  `\normalfont `{=latex}$^\dagger$Core Contributor.`\;`{=latex}`\;`{=latex}`\faCube`{=latex}`\;`{=latex}Project Lead.`\;`{=latex}`\;`{=latex}$^\S$Senior Author.\
  `\normalfont`{=latex}`^{1}Hong Kong University of Science and Technology, ^{2}National University of Singapore, ^{3}University of Oxford, ^{4}Nanyang Technological University, ^{5}Chinese University of Hong Kong, ^{6}University of Hong Kong, ^{7}University of Washington, ^{8}Hong Kong University of Science and Technology (Guangzhou), ^{9}Singapore University of Technology and Design, ^{10}Singapore Management University`
bibliography:
- main.bib
title: 'Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\cmark}{{\color{green!60!black}\ding{52}}}
```
```{=latex}
\newcommand{\xmark}{{\color{red}\ding{55}}}
```
```{=latex}
\newcommand{\paperlink}[1]{\faBookOpen\,\href{#1}{\textcolor{magenta}{Paper}}}
```
```{=latex}
\newcommand{\githublink}[1]{\faGithub\,\href{#1}{\textcolor{magenta}{Code}}}
```
```{=latex}
\newcommand{\figplaceholderw}[1]{%
  \fbox{\large\textit{[#1]}}%
}
```
```{=latex}
\def\month{April}
```
```{=latex}
\def\year{2026}
```
```{=latex}
\def\openreview{\textit{OpenReview link to be added upon acceptance}}
```
```{=latex}
\maketitle
```
```{=latex}
\makeatletter
```
```{=latex}
\gdef
```
```{=latex}
\@github@title{\raisebox{-0.15em}{\includegraphics[height=1.1em]{assets/title_icon.png}}\;\,Agentic World Modeling: \\ Foundations, Capabilities, Laws, and Beyond}
```
```{=latex}
\makeatother
```
```{=latex}
\makeatletter
```
```{=latex}
\renewcommand{\aftertitskip}{0.8em}
```
```{=latex}
\makeatother
```
```{=latex}
\makeatletter
```
```{=latex}
\if@github
```
```{=latex}
\newpage
```
```{=latex}
\fi
```
```{=latex}
\makeatother
```
```{=latex}
\begin{figure*}[ht]


\resizebox{0.9\textwidth}{!}{\tikzset{
    my node/.style={
        draw,
        align=left,
        thin,
        text width=2.8cm,
        rounded corners=3,
    },
    my leaf/.style={
        draw,
        align=left,
        thin,
        text width=4cm,
        rounded corners=3,
    }
}

\forestset{
  every leaf node/.style={
    if n children=0{#1}{}
  },
  every tree node/.style={
    if n children=0{minimum width=1em}{#1}
  },
}
\begin{forest}
    for tree={%
        every leaf node={my leaf, font=\small\sffamily},
        every tree node={my node, font=\small\sffamily, l sep-=4.5pt, l-=1.pt},
        anchor=west,
        inner sep=2pt,
        l sep=10pt,
        s sep=4pt,
        fit=tight,
        grow'=east,
        edge={thick},
        parent anchor=east,
        child anchor=west,
        if n children=0{tier=last}{},
        edge path={
            \noexpand\path [draw, \forestoption{edge}] (!u.parent anchor) -- +(5pt,0) |- (.child anchor)\forestoption{edge label};
        },
        if={isodd(n_children())}{
            for children={
                if={equal(n,(n_children("!u")+1)/2)}{calign with current}{}
            }
        }{}
    }
    [{Agentic World Modeling}, draw=gray, color=gray!100, fill=gray!15, very thick, text=black, text width=2.2cm,
        [\S\ref{sec:introduction} Introduction, color=Cyan!100, fill=Cyan!15, very thick, text=black, text width=3.5cm
            [\S\ref{sec:motivation} Motivation, color=Cyan!100, fill=Cyan!15, very thick, text=black, text width=5cm]
            [\S\ref{sec:scope} Scope \& Organizing Principle, color=Cyan!100, fill=Cyan!15, very thick, text=black, text width=5cm]
            [\S\ref{sec:contributions} Contributions \& Positioning, color=Cyan!100, fill=Cyan!15, very thick, text=black, text width=5cm]
        ]
        [\S\ref{sec:preliminaries} Preliminaries, color=BlueGreen!100, fill=BlueGreen!15, very thick, text=black, text width=3.5cm
            [\S\ref{sec:philosophy} Epistemology to Capability Hierarchy, color=BlueGreen!100, fill=BlueGreen!15, very thick, text=black, text width=6cm]
            [\S\ref{sec:trends:representation} Representation in World Modeling, color=BlueGreen!100, fill=BlueGreen!15, very thick, text=black, text width=6cm]
            [\S\ref{subsec:notation_foundations} Notations, color=BlueGreen!100, fill=BlueGreen!15, very thick, text=black, text width=6cm]
            [\S\ref{subsec:formal_defs} Definitions of Capabilities, color=BlueGreen!100, fill=BlueGreen!15, very thick, text=black, text width=6cm]
            [\S\ref{subsec:scope_regimes} Scope of Laws, color=BlueGreen!100, fill=BlueGreen!15, very thick, text=black, text width=6cm]
        ]
        [\S\ref{sec:l1} L1 Predictor, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, text width=3.5cm
            [\S\ref{subsec:l1_definition} Definition, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, text width=4cm]
            [\S\ref{subsec:l1_methods} Approaches, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, text width=4cm
                [State Inference, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, tier=L1, text width=3.5cm]
                [Forward Dynamics, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, tier=L1, text width=3.5cm]
                [Observation Decoding, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, tier=L1, text width=3.5cm]
                [Inverse Dynamics, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, tier=L1, text width=3.5cm]
            ]
            [\S\ref{subsec:l1_theory_boundaries} Discussion, color=Periwinkle!100, fill=Periwinkle!15, very thick, text=black, text width=4cm]
        ]
        [\S\ref{sec:l2} L2 Simulator, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, text width=3.5cm
            [\S\ref{subsec:l2_requirements} Requirements for Elevation, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, text width=4.5cm]
            [\S\ref{subsec:l2_app}Applications, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, text width=3cm
                [\S\ref{subsec:l2_physical} Physical World, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, tier=L2, text width=4.5cm]
                [\S\ref{subsec:l2_software} Digital World, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, tier=L2, text width=4.5cm]
                [\S\ref{subsec:l2_social} Social World, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, tier=L2, text width=4.5cm]
                [\S\ref{subsec:l2_science} Scientific World, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, tier=L2, text width=4.5cm]
                [\S\ref{subsec:l2_crossdomain} Cross-Domain Analysis, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, tier=L2, text width=4.5cm]
            ]
            [\S\ref{subsec:l2_failure_modes} Failure Modes, color=darkpastelgreen!100, fill=darkpastelgreen!15, very thick, text=black, text width=4.5cm]
        ]
        [\S\ref{sec:l3} L3 Evolver, color=Orchid!100, fill=Orchid!15, very thick, text=black, text width=3.5cm
            [\S\ref{subsec:l3_definition} Formal Definition, color=Orchid!100, fill=Orchid!15, very thick, text=black, text width=5cm]
            [\S\ref{subsec:l2_vs_l3} Distinction from L2, color=Orchid!100, fill=Orchid!15, very thick, text=black, text width=5cm]
            [\S\ref{subsec:l3_examples} Examples \& Applications, color=Orchid!100, fill=Orchid!15, very thick, text=black, text width=5cm]
            [\S\ref{subsec:l3_context} L3 in Context, color=Orchid!100, fill=Orchid!15, very thick, text=black, text width=5cm]
        ]
        [\S\ref{sec:evaluation} Evaluations, color=Goldenrod!100, fill=Goldenrod!20, very thick, text=black, text width=3.5cm
            [\S\ref{subsec:eval_decision} Prediction vs Decision-Centric, color=Goldenrod!100, fill=Goldenrod!20, very thick, text=black, text width=5cm]
            [\S\ref{subsec:eval_boundary} Three Boundary Conditions, color=Goldenrod!100, fill=Goldenrod!20, very thick, text=black, text width=5cm]
            [\S\ref{subsec:eval_levels} L1/L2/L3 Differentiation, color=Goldenrod!100, fill=Goldenrod!20, very thick, text=black, text width=5cm]
            [\S\ref{subsec:eval_benchmarks} Benchmarks \& Coverage, color=Goldenrod!100, fill=Goldenrod!20, very thick, text=black, text width=5cm]
            [\S\ref{subsec:eval_open} Open Challenges, color=Goldenrod!100, fill=Goldenrod!20, very thick, text=black, text width=5cm]
        ]
        [\S\ref{sec:implementation} Practice Instantiation, color=Melon!100, fill=Melon!20, very thick, text=black, text width=3.5cm
            [\S\ref{subsec:impl_blocks} Architectural Building Blocks, color=Melon!100, fill=Melon!20, very thick, text=black, text width=5cm]
            [\S\ref{subsec:impl_tradeoffs} Design Tradeoffs by Regime, color=Melon!100, fill=Melon!20, very thick, text=black, text width=5cm]
            [\S\ref{subsec:impl_roadmap} Implementation Roadmap, color=Melon!100, fill=Melon!20, very thick, text=black, text width=5cm]
        ]
        [\S\ref{sec:trends} Trends \& Open Problems, color=CornflowerBlue!100, fill=CornflowerBlue!15, very thick, text=black, text width=5cm
            [\S\ref{sec:trends:history} Historical Development, color=CornflowerBlue!100, fill=CornflowerBlue!15, very thick, text=black, text width=5cm]
            % [\S\ref{sec:trends:representation}  Representation in World Modeling, color=CornflowerBlue!100, fill=CornflowerBlue!15, very thick, text=black, text width=5.5cm]
            [\S\ref{sec:trends:open} Open Problems by Level, color=CornflowerBlue!100, fill=CornflowerBlue!15, very thick, text=black, text width=5cm]
            [\S\ref{sec:trends:beyond} Beyond L3, color=CornflowerBlue!100, fill=CornflowerBlue!15, very thick, text=black, text width=5cm]
        ]
        [\S\ref{sec:conclusion} Conclusion, color=gray!100, fill=gray!15, very thick, text=black, text width=5cm
        ]
    ]
\end{forest}
}
\caption{\textbf{Organizational structure of this survey.} The paper is organized around three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and four governing-law regimes (physical, digital, social, scientific worlds), with supporting sections on evaluation, implementation, and open problems.}

\label{fig:toc_tree}

\end{figure*}
```
```{=latex}
\newpage
```
Introduction {#sec:introduction}
============

::: {.epigraph}
*One may say the eternal mystery of the world is its comprehensibility.*

@einstein1936physics
:::

The ambition to build internal models of reality has a long intellectual history, appearing in philosophical accounts of mental models [@craik1943nature; @johnson1983mental] and in modern machine learning as learned latent dynamics that support prediction, control, simulation, and scientific reasoning [@ha2018worldmodels; @hafner2019dreamer; @karniadakis2021pinn]. The phrase *world model* is now widely used across research communities, but its precise technical meaning varies considerably [@ding2024survey_wm; @zhu2024sora_survey]. In reinforcement learning, agents learn transition structure to imagine futures before acting [@sutton1991dyna; @ha2018worldmodels; @hafner2019dreamer; @schrittwieser2020muzero]. In computer vision, *world models* often denote video or 3D generators that maintain visual dynamics and temporal coherence [@brooks2024sora; @bruce2024genie; @nvidia2025cosmos; @worldlens; @liang2026lidarcrafter; @bian2025dynamiccity; @kong2025survey_3d4d]. In language modeling and agent systems, the term can refer to text-grounded simulation for planning, web interaction, and social environments [@wang2024worldsim; @gu2024webdreamer; @park2023generative; @zhang2026scafgrpo; @zhang2026searchgym]. In robotics, learned dynamics serve safe planning, data-efficient policy learning, and sim-to-real transfer [@wu2023daydreamer; @yang2024unisim; @min2024driveworld]. For science, systems pair surrogate models with hypothesis-driven experimentation [@karniadakis2021pinn; @lu2024aiscientist].

From a complementary perspective, world models and agents are closely coupled. At its core, a world model learns the state-transition dynamics of an environment: given a current state and an action, it predicts the resulting next state. An agent, conversely, selects actions given a task objective and its current observations. These two components are mutually supportive. Agents rely on world models to anticipate the consequences of candidate actions, enabling look-ahead planning and sample-efficient learning [@hafner2023dreamerv3; @schrittwieser2020muzero; @dong2026lcvn; @dong2026uniwm]. Conversely, world models benefit from agent-generated experience, which provides targeted, task-relevant trajectories that improve the model's accuracy in decision-critical regions of the state space [@sutton1991dyna]. This close coupling motivates the capability-based perspective adopted in this survey: while world models serve many purposes, we operationally define their value by the quality of decisions they enable for downstream agents.

Because world models constitute a foundational component whose value extends beyond any single agent architecture, their growing importance makes conceptual clarity all the more urgent. Yet the diversity outlined above also creates conceptual fragmentation: a vision researcher may evaluate a world model by the visual fidelity of its generated frames, while a reinforcement learning practitioner judges the same term by whether it improves task performance. As a result, papers may report strong progress under one interpretation of *world model* while remaining incomparable under another. This paper addresses that fragmentation by providing a common language that can align communities without erasing domain-specific differences.

Motivation {#sec:motivation}
----------

1.  **Current survey landscape.** Several recent surveys have attempted to organize this rapidly growing literature. @ding2024survey_wm propose a dual taxonomy of *understanding* versus *predicting*, mapping world models onto application domains such as autonomous driving, robotics, and social simulacra. @zhu2024sora_survey focus on the generative capabilities catalyzed by Sora, surveying world models for video generation, autonomous driving, and autonomous agents. @yue2025simulating provide a roadmap for 2D visual world modeling with a four-generation capability taxonomy (G1--G4) applied to robotics, autonomous driving, and gaming. Their G1--G4 taxonomy is useful for distinguishing increasingly interactive visual generation systems; our L1--L3 hierarchy is complementary rather than competing, because it abstracts away from the visual modality and asks whether a system supports local prediction, decision-usable simulation, or evidence-driven revision across physical, digital, social, and scientific regimes. Roughly, early G-levels emphasize appearance and action-conditioned prediction, whereas our L2/L3 boundary is determined by constraint-valid rollout and persistent model update. Domain-specific surveys have also proliferated: @li2025embodied_wm_survey provide a three-axis framework (functionality, temporal modeling, spatial representation) specifically for embodied AI; @feng2025ad_wm_survey and @tu2025wm_ad_survey survey world models for autonomous driving; @kong2025survey_3d4d examine 3D and 4D world modeling; @zhang2025wm_manipulation_survey survey world models for robotic manipulation; and a growing number of position papers question what it means for a learned model to \`\`understand" physics [@lecun2022path; @kang2025howfar]. In AI for science, @wei2025agenticscience survey autonomous scientific discovery across life sciences, chemistry, materials, and physics, unifying process-oriented, autonomy-oriented, and mechanism-oriented perspectives. A parallel line of surveys addresses agent planning and reasoning: @wei2025plangenllms survey LLM planning capabilities across plan generation and verification, @huang2024planningsurvey taxonomize planning mechanisms into decomposition, selection, and reflection, @cao2025llmplanning provide a systematic comparison of fine-tuning versus search-based planning methods, @zhao2025agenticreasoning organize agentic reasoning into single-agent, tool-based, and multi-agent frameworks, and @arunkumar2026agenticai propose a unified agent taxonomy spanning perception, planning, action, and collaboration. These surveys complement ours: they focus on how agents *decide and act*, whereas we focus on the predictive substrate (the world model) that makes those decisions informed. Despite their valuable contributions, existing surveys share a common organizational principle that we argue is fundamentally limiting: they partition the field by **modality** or by **application domain**. Our work differs by organizing the field through a capability-based taxonomy that cuts across modalities, covering decision-making domains from embodied manipulation and autonomous driving to web agents, multi-agent coordination, and scientific discovery pipelines.

    ![**Positioning of this survey relative to existing world model and agent surveys.** Four clusters, Embodied World Models, Generative World Models, Language Agents, and AI for Science, each cover subsets of the field. Our survey (center) integrates cross domain coverage with a capability based taxonomy (L1/L2/L3 × four regimes), bridging largely isolated communities.](assets/venn.png){#fig:survey_positioning width="\\textwidth"}

    #### Gaps in existing surveys.

    The modality-centric and domain-centric taxonomies leave two critical gaps. First, they fail to capture the *capability progression* that cuts across modalities. A key example is model-based reinforcement learning, where latent-space \`\`imagination" rollouts can match or exceed model-free baselines across diverse domains such as Atari, continuous control, and Minecraft [@hafner2023dreamerv3; @schrittwieser2020muzero; @hafner2019dreamer]. We formalize this progression as a three-level capability hierarchy: one-step prediction, long-horizon simulation, and evidence-driven model revision. A second motivation for our framework is the intensifying debate over whether large-scale generative models are merely plausible generators or genuine world simulators. Existing surveys have surfaced this tension [@brooks2024sora; @bruce2024genie; @kang2025howfar; @ding2024survey_wm], but a capability-based taxonomy helps state the question more precisely in terms of rollout, intervention sensitivity, and constraint consistency. We identify four progressively stronger capabilities, namely rollout, intervention sensitivity, constraint consistency, and closed-loop use, that characterize world models and go beyond generic predictors (formalized in Section `\ref{sec:preliminaries}`{=latex}). Moreover, existing surveys underrepresent the role of world modeling in agentic AI applications, including web agents, tool-use agents, and multi-agent systems, where learned environment dynamics are essential for planning and action selection [@gu2024webdreamer; @wang2024worldsim; @park2023generative]. The goal of this paper is to establish a capability-based taxonomy with clear and testable boundary conditions, and to use it to connect research communities that currently evaluate world modeling systems with different assumptions, objectives, and metrics.

Figure `\ref{fig:survey_positioning}`{=latex} positions this survey relative to existing work along two axes: scope (domain-specific to cross-domain) and organizing principle (modality-centric to capability-centric). Figure `\ref{fig:toc_tree}`{=latex} shows the organizational structure of the paper at a glance, grouping sections by the three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and the four governing-law regimes (physical, digital, social, and scientific worlds).

Scope and Organizing Principle {#sec:scope}
------------------------------

#### Governing principles across domains.

We organize the paper along two orthogonal axes: (i) **capability level** (L1/L2/L3, defined formally in Section `\ref{sec:preliminaries}`{=latex}), and (ii) **governing-law regime**, the constraints that legitimate transitions must satisfy in a domain. These levels are stages of world-modeling capability rather than mutually exclusive model classes: the same system may invoke different levels at different moments depending on task demand. Figure `\ref{fig:four_worlds}`{=latex} provides a schematic overview of these four regimes.

-   **Laws of the Physical World**: perception; physical interaction; robotic manipulation, navigation, autonomous driving, egocentric video prediction, action-conditioned video modeling, 3D world modeling..

-   **Laws of the Digital World**: program semantics; web navigation, software tool use, GUI environments.

-   **Laws of the Social World**: beliefs; goals; norms; social coordination, dialogue, multi-agent settings.

-   **Laws of the Scientific World**: latent mechanisms; experimental observables; causal structure; scientific discovery pipelines, measurement-coupled prediction, hypothesis-driven experimentation.

![**Schematic illustrations of the four governing-law regimes.** Representative scenes for each regime: a humanoid agent manipulating blocks (Physical World), code and UI surfaces (Digital World), a network of interacting agents with speech acts (Social World), and instrumented experimentation with robotic microscope and pipette (Scientific World). Each regime's formal constraints are discussed in Section `\ref{subsec:scope_regimes}`{=latex}.](assets/four_worlds_4k.png){#fig:four_worlds width="\\textwidth"}

In particular, the physical and scientific regimes are separated by how constraints are accessed: physical-world systems often admit analytic or simulator-based verification of transitions, whereas scientific-world systems typically require empirical validation because the governing mechanisms are only partially known.

Regimes are not \`\`orthogonal modalities": real systems mix them. The value of the taxonomy is diagnostic; it clarifies *which* invariants a method tries to preserve and *which queries* it can answer reliably.

More generally, a world model can predict transitions along any organizing dimension, such as spatial scales, frequency bands, or causal depth, provided it maintains the capability criteria along that axis. Throughout, we use *world model* to denote learned (or hybrid) operators that support intervention-aware transition queries, and *world modeling* to denote the staged process of strengthening those operators.

#### How an agent uses the three levels at runtime.

The L1/L2/L3 taxonomy is not a static classification of systems but a description of the capability an agent invokes at any given moment. A single deployed system can operate at different levels depending on the task demand:

1.  **L1 (Predictor).** The agent executes fast, reactive one-step predictions (such as perception, low-level motor control, or token-by-token generation) without maintaining a multi-step plan.

2.  **L2 (Simulator).** The agent upgrades to this level when the task requires comparing candidate action sequences, reasoning counterfactually about alternative futures, or verifying that a planned trajectory respects governing-law constraints; here the agent rolls out a multi-step simulation before committing.

3.  **L3 (Evolver).** The agent escalates to this level when its current model produces systematic prediction failures that cannot be resolved by re-planning within the existing model structure, that is, when the model itself must be revised, assets distilled, and updates validated before the next deployment.

This runtime dispatch view clarifies why L3 is not a replacement for L1/L2 but a governance layer that improves the stack when evidence demands it. Within a full agentic stack, world models are only one component: tool use determines how the agent acts on the environment, memory determines what evidence persists across episodes, multi-agent coordination shapes the effective transition dynamics in social settings, and reflection determines when failures trigger revision rather than mere re-planning. Our focus is the world-model substrate, but its role is always in service of these broader agentic loops.

```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/fig3_roadmap.pdf}
\caption{\textbf{Timeline of representative world-modeling systems (2018--2026) organized by capability level.} The roadmap shows 70 survey anchors, capped at five systems per year--level cell for readability. L1~Predictor denotes one-step dynamics, L2~Simulator denotes decision-usable multi-step rollout, and L3~Evolver denotes full evidence-driven model revision; partial L3 loops remain in Table~\ref{tab:l3_systems}. Each pill is colored by governing-law regime: \textcolor[HTML]{3B78CF}{\textbf{Physical}} (blue), \textcolor[HTML]{399253}{\textbf{Digital}} (green), \textcolor[HTML]{D48436}{\textbf{Social}} (orange), and \textcolor[HTML]{67236C}{\textbf{Scientific}} (purple).}
\label{fig:paper_roadmap}
\end{figure*}
```
Contributions and Positioning {#sec:contributions}
-----------------------------

```{=latex}
\begin{keypoint}[Key Contributions]This paper makes three principal contributions (Figure~\ref{fig:paper_roadmap}):
\begin{enumerate}[leftmargin=*]
 \item \textbf{Capability-based roadmap for world modeling in agentic AI (L1$\to$L2$\to$L3).} We propose a three-level capability hierarchy with testable boundary conditions: L1 \textbf{Predict World} (one-step prediction), L2 \textbf{Simulate World} (long-horizon, action-conditioned rollout with constraint satisfaction), and L3 \textbf{Modify World} (evidence-driven model growth through autonomous data collection and dynamics revision). These are stages of capability, not types of models.
 \item \textbf{Cross-domain synthesis via governing laws.} We unify computer vision, language modeling, model-based RL and robotics, and AI for science into a single capability coordinate system. Different governing laws (Section~\ref{sec:preliminaries}) define the types or partitions of world models, \textbf{partially independent} of the L1$\to$L2$\to$L3 capability axis. This two-dimensional organization (capability level $\times$ law regime) reveals shared principles across communities that have developed in isolation, while clarifying domain-specific challenges that make direct transfer non-trivial.
 \item \textbf{L3 as a distinct capability level.} Evidence-driven model growth, where a system autonomously collects new evidence and revises its own dynamics model, has appeared in scattered forms across scientific discovery~\citep{lu2024aiscientist}, autonomous experimentation, and online adaptation. We argue this capability is qualitatively different from L2 rollout and formalize it as a distinct level, and identifying the open problems that must be resolved to realize this capability at scale.
\end{enumerate}
\end{keypoint}
```
#### Positioning.

We present this paper as a *position-driven survey proposing a capability taxonomy for world modeling*. It advances a specific conceptual framework, namely the L1/L2/L3 capability hierarchy paired with a governing-law regime taxonomy, and argues for its adoption across the world modeling community. Unlike a pure survey, it proposes testable boundary conditions and uses them to re-examine how existing systems are classified. Unlike a pure position paper, it substantiates each argument with a comprehensive literature review spanning computer vision, reinforcement learning, robotics, natural language processing, and AI for science. This paper does not introduce a new benchmark or leaderboard; instead, it offers a unifying conceptual framework for interpreting and comparing existing systems and evaluations.

#### Outline.

Section `\ref{sec:preliminaries}`{=latex} establishes the conceptual and notational foundations: it motivates the three capability stages from epistemological intuition, gives each a formal definition with testable boundary conditions, and clarifies the distinctions between world modeling and generic prediction, world models and planners, and world modeling and commonsense. Sections `\ref{sec:l1}`{=latex}--`\ref{sec:l3}`{=latex} present the three capability levels in detail with representative methods and cross-domain analysis. Section `\ref{sec:evaluation}`{=latex} discusses evaluation methodology, Section `\ref{sec:implementation}`{=latex} addresses architectural and computational considerations, and Section `\ref{sec:trends}`{=latex} identifies emerging trends and open problems. Section `\ref{sec:conclusion}`{=latex} concludes. We note that L3 is not a terminal stage; Section `\ref{sec:trends}`{=latex} introduces *meta-world modeling*, in which the governing laws themselves become learnable, and identifies the open problems this entails.

Preliminaries {#sec:preliminaries}
=============

This section establishes the conceptual and notational foundations used throughout the paper. **(1) From epistemology to a capability hierarchy** draws on philosophical traditions to propose a three-level decomposition of world modeling capability (L1 Predictor, L2 Simulator, L3 Evolver) and to motivate why the boundaries fall where they do. **(2) Notation and formal definitions** fixes a unified symbol system and uses it to give each stage (L1, L2, L3) a precise definition with testable boundary conditions. **(3) Conceptual boundaries** clarifies the distinctions between world modeling and generic prediction, between world models and planners, and relates world modeling to the broader notion of commonsense reasoning that underwrites the reliable everyday action that agents must exhibit beyond narrow predictive tasks.

Philosophical Motivations {#sec:philosophy}
-------------------------

A natural question for any world-modeling survey is: *what stages of understanding does a system pass through as it moves from pattern matching to genuine modeling?* Epistemology, the study of what counts as knowledge and how knowledge grows, offers a useful lens. Different philosophical traditions identify qualitatively different kinds of epistemic achievement; we draw on these traditions to propose a three-level capability hierarchy for world models. These philosophical analogies are heuristic rather than historical or one-to-one. We do not claim that ML systems implement philosophical programmes, but that philosophical distinctions help us see why certain capability boundaries recur across domains and what design questions each stage foregrounds (Figure `\ref{fig:philosophy_staircase}`{=latex}). Due to space constraints, detailed philosophical motivations and contemporary examples, together with extended historical context, are deferred to Appendix `\ref{app:philosophy_extended}`{=latex}.

![**From local prediction to evidence-driven revision: a hierarchical view of world modeling.** Level 1 models empirical regularities for prediction, Level 2 supports possible-world semantics and counterfactual simulation, and Level 3 introduces evidence-driven revision through continual interaction with the environment. This hierarchy frames world modeling as an ascending process from pattern recognition, to temporal rollout, to adaptive model evolution in real-world practice.](figures/fig4_0424.png){#fig:philosophy_staircase width="\\textwidth"}

#### L1 Predictor: from pattern to one-step forecast. {#sec:philosophy_l1}

The simplest epistemic achievement is learning patterns from data: given past observations, predict the next one. In philosophy, this is the terrain of Hume's *constant conjunction* [@hume1739treatise]; an agent records statistical co-occurrences without certifying why they hold. When a model learns one-step latent transitions from trajectories, it occupies exactly this epistemic position: it extracts succession from data and bets that the pattern persists. This view aligns with predictive coding framework in cognitive science [@rao1999predictive; @friston2010free] and the \`\`Bayesian brain" hypothesis that perception is probabilistic inference [@clark2015surfing], motivating one-step latent forecasting as a computational primitive [@lake2017building].

We call this stage **L1 (Predictor)**. This Humean stance has inherent fragility. The i.i.d. assumption underlying most ML is effectively Hume's *Uniformity Principle* (the premise that the future will resemble the past), so when the distribution shifts, L1 models that rely on learned regularities fail to generalize. Nevertheless, this provides the most basic inductive bias, which is the foundation of modeling.

#### L2 Simulator: rollout and counterfactual. {#sec:philosophy_l2}

Pattern matching alone does not answer *what would happen if we acted differently*. The next stage adds intervention and counterfactual reasoning: the ability to roll out coherent futures under chosen actions or hypothetical initial conditions and use the results for decision-making. David Lewis's theory of *closest possible worlds* [@lewis1973counterfactuals] captures this jump: effective counterfactual reasoning explores worlds maximally similar to our own, where only a minimal intervention distinguishes actual from counterfactual outcomes, providing a principled basis for reasoning about what would have happened under alternative actions taken by the agent at decision points.

We call this stage **L2 (Simulator)**. Because L2 rollouts are model-relative, their reliability depends on the learned model's own transition structure rather than on direct access to ground-truth dynamics. They risk epistemic drift, which produces internally coherent trajectories for the training manifold. Plato's Allegory of the Cave [@plato1992republic] offers a vivid metaphor: a simulator excelling at predicting shadows on a wall may remain fundamentally bounded by the wall's dimensions, unable to access the fire casting those shadows.

#### L3 Evolver: model revision from evidence. {#sec:philosophy_l3}

Even a powerful simulator eventually encounters situations where its predictions systematically fail, not because of parameter error but because the model class itself is too narrow. Epistemology offers a rich vocabulary for this transition. Lakatos's distinction between a *hard core* (architecture, inductive biases) and a *protective belt* (learned parameters) [@lakatos1978methodology] provides a useful parallel. Gradient steps mostly adjust the belt, while persistent structured errors may require changes to the core, such as new modules, parsers, constraints, or simulator hooks.

We call this stage **L3 (Evolver)**: the capacity to rebuild the laboratory when evidence demands it. This extends the full design--execute--observe--reflect loop: the system not only simulates but actively designs experiments, executes them, observes outcomes, and reflects to revise its model stack. Duhem--Quine holism [@duhem1954aim; @quine1951two] explains why blame-assignment is non-trivial. Errors redistribute across modules until diagnostics isolate the brittle component. Proposed revisions should yield measurable improvements on held-out probes, regression suites, or experimental outcomes, rather than post-hoc adjustments that preserve the existing model despite contrary evidence from the environment.

Representation in World Modeling: Lessons from Scientific Theories {#sec:trends:representation}
------------------------------------------------------------------

The capability hierarchy in Section `\ref{sec:philosophy}`{=latex} addresses *what a world model can do*, but leaves open a prior question: *in what form should the world model actually be represented?* This question should no just solely treated as an implementation detail, yet it determines whether the capabilities defined above, especially L3 revision, are realizable in practice across the diverse application domains covered in later sections.

```{=latex}
\resizebox{16.5cm}{!}{%
\begin{tikzpicture}[
    x=1.1cm, y=1.4cm,
    ms/.style={circle, draw=black, thick, inner sep=2pt},
    msBlue/.style={ms, fill=black!6!blue!12},
    msOrange/.style={ms, fill=black!4!orange!12},
    msGreen/.style={ms, fill=black!4!green!10},
    msPurple/.style={ms, fill=black!4!purple!12},
    labA/.style={anchor=south, font=\footnotesize, align=center, inner sep=3pt},
    labB/.style={anchor=north, font=\footnotesize, align=center, inner sep=3pt},
    era/.style={font=\normalsize\bfseries, anchor=south},
]

% === Era bands (soft muted colors matching NanoBanana #1) ===
\fill[black!6!blue!12]   (0,-0.6)    rectangle (3.6,2.2);
\fill[black!4!orange!12] (3.6,-0.6)  rectangle (9.5,2.2);
\fill[black!4!green!10]  (9.5,-0.6)  rectangle (16.0,2.2);
\fill[black!4!purple!12] (16.0,-0.6) rectangle (20.0,2.2);

% === Era labels ===
\node[era] at (1.8, 2.3) {{\color{blue!60}\faCompass}\ Mathematical Principles};
\node[era] at (6.55, 2.3) {{\color{orange!70}\faPuzzlePiece}\ Symbolic Intelligence};
\node[era] at (12.75, 2.3) {{\color{green!60!black}\faBrain}\ Connectionist Resurgence};
\node[era] at (18.0, 2.3) {{\color{purple!60}\faRocket}\ Generative Revolution};

% === Timeline arrow ===
\draw[->, very thick] (-0.3, 0.7) -- (20.3, 0.7);

% === AI winters as wide vertical bands (chronologically correct) ===
% AI Winter I (1974--80): after Lighthill 1973 (7.8), before Backprop 1986 (9.5)
\fill[cyan!18] (8.1, -0.6) rectangle (8.8, 2.2);
\fill[pattern=north east lines, pattern color=blue!30]
    (8.1, -0.6) rectangle (8.8, 2.2);
\node[font=\small\bfseries, blue!70!black, rotate=90] at (8.45, 0.7)
    {\faSnowflake\ AI Winter I};

% AI Winter II (1987--93): after Backprop 1986, before LeNet 1998
\fill[cyan!18] (10.3, -0.6) rectangle (11.0, 2.2);
\fill[pattern=north east lines, pattern color=blue!30]
    (10.3, -0.6) rectangle (11.0, 2.2);
\node[font=\small\bfseries, blue!70!black, rotate=90] at (10.65, 0.7)
    {\faSnowflake\ AI Winter II};

% === Pre-AI milestones (3): 1687, 1814, 1950 ===
\node[msBlue] at (0.5, 0.7) {};
\node[labA] at (0.5, 0.9) {Newton\\[-2pt]1687};

\node[msBlue] at (1.5, 0.7) {};
\node[labB] at (1.5, 0.5) {Laplace\\[-2pt]1814};

\node[msBlue] at (2.6, 0.7) {};
\node[labA] at (2.6, 0.9) {Turing\\[-2pt]1950};

% === Symbolic milestones (4): 1956, 1969, 1971, 1973 ===
\node[msOrange] at (3.6, 0.7) {};
\node[labB] at (3.6, 0.5) {Dartmouth\\[-2pt]1956};

\node[msOrange] at (5.0, 0.7) {};
\node[labA] at (5.0, 0.9) {Frame Problem\\[-2pt]1969};

\node[msOrange] at (6.2, 0.7) {};
\node[labB] at (6.2, 0.5) {STRIPS\\[-2pt]1971};

\node[msOrange] at (7.2, 0.7) {};
\node[labA] at (7.2, 0.9) {Lighthill\\[-2pt]1973};

% === Connectionist milestones (5): 1986, 1998, 2012, 2017, 2018 ===
\node[msGreen] at (9.5, 0.7) {};
\node[labB] at (9.5, 0.5) {Backprop\\[-2pt]1986};

\node[msGreen] at (11.5, 0.7) {};
\node[labA] at (11.5, 0.9) {LeNet\\[-2pt]1998};

\node[msGreen] at (12.7, 0.7) {};
\node[labB] at (12.7, 0.5) {AlexNet\\[-2pt]2012};

\node[msGreen] at (13.9, 0.7) {};
\node[labA] at (13.9, 0.9) {Transformer\\[-2pt]2017};

\node[msGreen] at (15.1, 0.7) {};
\node[labB] at (15.1, 0.5) {World Models\\[-2pt]2018};

% === Foundation milestones (6): 2020, 2023, 2023, 2024, 2025 ===
\node[msPurple] at (16.0, 0.7) {};
\node[labA] at (16.0, 0.9) {DDPM / GPT-3\\[-2pt]2020};

\node[msPurple] at (17.0, 0.7) {};
\node[labB] at (17.0, 0.5) {GraphCast\\[-2pt]2023};

\node[msPurple] at (17.8, 0.7) {};
\node[labA] at (17.8, 0.9) {DreamerV3\\[-2pt]2023};

\node[msPurple] at (18.7, 0.7) {};
\node[labB] at (18.7, 0.5) {Sora\\[-2pt]2024};

\node[msPurple] at (19.6, 0.7) {};
\node[labA] at (19.6, 0.9) {AlphaEvolve\\[-2pt]2025};

\end{tikzpicture}%
}
```
Historically, symbolic approaches to machine intelligence struggled to scale (see Section `\ref{sec:trends:history}`{=latex}), leading modern systems to adopt latent, implicit representations. Here scientific theories offer a telling contrast. Newton's laws, Maxwell's equations, and the Standard Model are instances of world models expressed in compact *symbolic* form, and arguably represent the most successful human instances of L3 systems: explicit, revisable, and composable. This contrast forces a question the field has largely avoided: is the endpoint of world modeling symbolic discovery, with neural latents as a scaffold, or are latent dynamics themselves the goal?

```{=latex}
\begin{figure*}[t]

% Unified graphical model for L1/L2/L3.
% Top block: agent (model M_t) in environment X, POMDP structure.
% Bottom block: revised agent (model M_{t+1}) in environment X'.
% Colored dashed boxes mark L1 (one-step), L2 (trajectory), L3 (revision).
% Legend on the right replaces inline labels to keep the diagram uncluttered.
\resizebox{\textwidth}{!}{%
\begin{tikzpicture}[
    x=2.6cm, y=1.6cm,
    env/.style={circle, draw=black, thick, dashed, minimum size=0.95cm, font=\small},
    learned/.style={circle, draw=black, double, double distance=1.2pt, thick, minimum size=0.95cm, font=\small},
    obs/.style={circle, draw=black, thick, fill=gray!25, minimum size=0.95cm, font=\small},
    act/.style={rectangle, draw=black, thick, minimum size=0.6cm, font=\small},
    arr/.style={->, thick, >=stealth},
    dyn/.style={->, thick, >=stealth, blue!70!black},
    em/.style={->, thick, >=stealth, dashed, black!60},
    lbl/.style={font=\scriptsize, inner sep=1pt},
    rowlbl/.style={font=\footnotesize, anchor=east},
    l1box/.style={draw=blue!60!black, dashed, thick, rounded corners=3pt, inner sep=5pt, fill=blue!8},
    l2box/.style={draw=green!50!black, dashed, thick, rounded corners=3pt, inner sep=9pt, fill=green!8},
    l3box/.style={draw=red!65!black, dashed, thick, rounded corners=5pt, inner sep=12pt, fill=red!6},
]

% ============================================================
% TOP BLOCK: Env X + Model M_t
% ============================================================

% Row 1: env states (dashed)
\node[env] (x0) at (0, 3)   {$x_{0}$};
\node[env] (x1) at (1, 3)   {$x_{1}$};
\node[env] (x2) at (2, 3)   {$x_{2}$};
\node       (xd) at (3, 3)  {$\cdots$};
\node[env] (xH) at (4, 3)   {$x_{H}$};

% Row 2: agent latents (double) and actions (squares)
\node[learned] (z0) at (0, 2)   {$z_{0}$};
\node[act]     (a0) at (0.5, 2) {$a_{0}$};
\node[learned] (z1) at (1, 2)   {$z_{1}$};
\node[act]     (a1) at (1.5, 2) {$a_{1}$};
\node[learned] (z2) at (2, 2)   {$z_{2}$};
\node           (zd) at (3, 2)  {$\cdots$};
\node[learned] (zH) at (4, 2)   {$z_{H}$};

% Row 3: observations (gray)
\node[obs] (o0) at (0, 1) {$o_{0}$};
\node[obs] (o1) at (1, 1) {$o_{1}$};
\node[obs] (o2) at (2, 1) {$o_{2}$};
\node       (od) at (3, 1){$\cdots$};
\node[obs] (oH) at (4, 1) {$o_{H}$};

% Env transitions
\draw[em] (x0) -- (x1) node[lbl, midway, above] {$T$};
\draw[em] (x1) -- (x2) node[lbl, midway, above] {$T$};
\draw[em] (x2) -- (xd);
\draw[em] (xd) -- (xH);

% Emission x -> o (bent, to the right to stay clear of the learning graph)
\draw[em] (x0) to[bend left=40] (o0);
\draw[em] (x1) to[bend left=40] (o1);
\draw[em] (x2) to[bend left=40] (o2);
\draw[em] (xH) to[bend left=40] (oH);

% Inference o -> z
\draw[dyn] (o0) -- (z0) node[lbl, midway, right, text=blue!70!black] {$q_\phi$};
\draw[dyn] (o1) -- (z1);
\draw[dyn] (o2) -- (z2);
\draw[dyn] (oH) -- (zH);

% Dynamics z -> a -> z
\draw[dyn] (z0) -- (a0);
\draw[dyn] (a0) -- (z1) node[lbl, midway, above, text=blue!70!black] {$p_\theta$};
\draw[dyn] (z1) -- (a1);
\draw[dyn] (a1) -- (z2);
\draw[dyn] (z2) -- (zd);
\draw[dyn] (zd) -- (zH);

% a -> next env state
\draw[em] (a0) -- (x1);
\draw[em] (a1) -- (x2);

% Row labels (aligned to a common column at x=-0.55)
\node[rowlbl, gray!80]  at (-0.55, 3) {Env $\mathcal{E}\!\sim\!\mathcal{X}$};
\node[rowlbl, blue!70!black]  at (-0.55, 2) {Model $\mathcal{M}_{t}$};
\node[rowlbl, gray!80]  at (-0.55, 1) {Observation};

% L1 / L2 / L3 highlight boxes are drawn together on the background layer
% AFTER all nodes/arrows are placed (see end of this tikzpicture).

% ============================================================
% REFLECT arrow: M_t -> M_{t+1} (short; asymmetric gap closed)
% ============================================================
\draw[arr, red!65!black, line width=1.3pt]
  (2, 0.4) -- (2, -0.2)
  node[midway, right=4pt, font=\small, text=red!65!black]
  {\textbf{reflect} $d_t\!:\!\mathcal{M}_t\!\to\!\mathcal{M}_{t+1}$};

% ============================================================
% BOTTOM BLOCK: Env X' + Model M_{t+1}
% ============================================================
\node[env] (xp0) at (0, -0.8)  {$x'_{0}$};
\node[env] (xp1) at (1, -0.8)  {$x'_{1}$};
\node[env] (xp2) at (2, -0.8)  {$x'_{2}$};
\node       (xpd) at (3, -0.8) {$\cdots$};
\node[env] (xpH) at (4, -0.8)  {$x'_{H}$};

\node[learned] (zp0) at (0, -1.8)   {$z'_{0}$};
\node[act]     (ap0) at (0.5, -1.8) {$a'_{0}$};
\node[learned] (zp1) at (1, -1.8)   {$z'_{1}$};
\node[act]     (ap1) at (1.5, -1.8) {$a'_{1}$};
\node[learned] (zp2) at (2, -1.8)   {$z'_{2}$};
\node           (zpd) at (3, -1.8)  {$\cdots$};
\node[learned] (zpH) at (4, -1.8)   {$z'_{H}$};

\node[obs] (op0) at (0, -2.8) {$o'_{0}$};
\node[obs] (op1) at (1, -2.8) {$o'_{1}$};
\node[obs] (op2) at (2, -2.8) {$o'_{2}$};
\node       (opd) at (3, -2.8){$\cdots$};
\node[obs] (opH) at (4, -2.8) {$o'_{H}$};

\draw[em] (xp0) -- (xp1) node[lbl, midway, above] {$T'$};
\draw[em] (xp1) -- (xp2) node[lbl, midway, above] {$T'$};
\draw[em] (xp2) -- (xpd);
\draw[em] (xpd) -- (xpH);
\draw[em] (xp0) to[bend left=40] (op0);
\draw[em] (xp1) to[bend left=40] (op1);
\draw[em] (xp2) to[bend left=40] (op2);
\draw[em] (xpH) to[bend left=40] (opH);
\draw[dyn] (op0) -- (zp0);
\draw[dyn] (op1) -- (zp1);
\draw[dyn] (op2) -- (zp2);
\draw[dyn] (opH) -- (zpH);
\draw[dyn] (zp0) -- (ap0);
\draw[dyn] (ap0) -- (zp1) node[lbl, midway, above, text=blue!70!black] {$p_{\theta'}$};
\draw[dyn] (zp1) -- (ap1);
\draw[dyn] (ap1) -- (zp2);
\draw[dyn] (zp2) -- (zpd);
\draw[dyn] (zpd) -- (zpH);
\draw[em] (ap0) -- (xp1);
\draw[em] (ap1) -- (xp2);

% Row labels bottom
\node[rowlbl, gray!80]  at (-0.55, -0.8) {Env $\mathcal{E}'\!\sim\!\mathcal{X}'$};
\node[rowlbl, blue!70!black]  at (-0.55, -1.8) {Model $\mathcal{M}_{t+1}$};
\node[rowlbl, gray!80]  at (-0.55, -2.8) {Observation};

% ---- Highlight boxes on background layer, L3 first (outermost / back),
%      then L2, then L1 (innermost / front of the backdrop). ----
\begin{scope}[on background layer]
\node[l3box, fit=(x0)(xH)(o0)(oH)(xp0)(xpH)(op0)(opH)] (L3) {};
\node[l2box, fit=(z0)(a0)(z1)(a1)(z2)(zd)(zH)] (L2) {};
\node[l1box, fit=(z0)(a0)(z1)] (L1) {};
\end{scope}

% ============================================================
% LEGEND (horizontal, below the L3 box -- three colored swatches + labels)
% ============================================================
\node[l1box, minimum width=0.6cm, minimum height=0.25cm, inner sep=0pt] (legL1) at (0.2, -3.5) {};
\node[anchor=base west, font=\small\bfseries, text=blue!60!black]  at (0.5, -3.55) {L1 Predictor};

\node[l2box, minimum width=0.6cm, minimum height=0.25cm, inner sep=0pt] (legL2) at (1.6, -3.5) {};
\node[anchor=base west, font=\small\bfseries, text=green!50!black] at (1.9, -3.55) {L2 Simulator};

\node[l3box, minimum width=0.6cm, minimum height=0.25cm, inner sep=0pt] (legL3) at (3.0, -3.5) {};
\node[anchor=base west, font=\small\bfseries, text=red!65!black]   at (3.3, -3.55) {L3 Evolver};

\end{tikzpicture}%
}

\caption{\textbf{Unified POMDP graphical model of L1-L3.} Dashed circles denote hidden environment states $x$; double circles denote learned latent states $z$; shaded circles denote observations $o$; squares denote actions $a$. Blue solid arrows denote the learned model (inference $q_\phi$ and dynamics $p_\theta$); dashed gray arrows denote the environment transition $T$ and observation emission. The top block shows the agent's POMDP under the current environment $\mathcal{E}\!\sim\!\mathcal{X}$ with model $\mathcal{M}_t$; the bottom block shows the same structure under a revised environment $\mathcal{E}'\!\sim\!\mathcal{X}'$ with model $\mathcal{M}_{t+1}$, obtained via the red \textbf{reflect} arrow. Colored dashed boxes mark each level's scope: L1 covers a single-step latent transition $p_\theta(z_t\mid z_{t-1}, a_{t-1})$; L2 covers the full trajectory rollout $\hat{p}(\tau\mid z_0, a_{1:H}, c)$ under a fixed model; L3 covers evidence-driven model revision $\mathcal{M}_t\to\mathcal{M}_{t+1}$, which corresponds to moving from $\mathcal{X}$ to a revised environment $\mathcal{X}'$ when the current model systematically fails.}
\label{fig:l1l2l3_graphical}
\end{figure*}
```
In scientific discovery, model updates arise at multiple scales: small anomalies trigger local modifications, while persistent discrepancies such as the \`\`two dark clouds" [@kelvin1901nineteenth] in late 19th-century physics expose epistemic gaps that force revisions to a theory's *invariance structure*. The shift from Newtonian to relativistic mechanics, for instance, replaced Galilean invariance with Lorentz invariance. Modern ML systems also encode invariances, such as translation equivariance in convolutions and shape bias in attention-based models [@geirhos2018imagenet], but do so *implicitly*, through architecture and training, rather than as explicitly modifiable structures. This suits L1 prediction and L2 simulation under a fixed model, but at L3 (where the task is to revise the model structure itself) it becomes a liability. Symbolic representations, by contrast, expose governing principles as first-class objects that can be directly inspected and modified.

We therefore take representation to be a foundational question about what a world model *is*, not a choice among interchangeable designs. Latent dynamics are indispensable as a scaffold for L1 and L2, but the endpoint of L3, namely genuine revision of governing laws, requires a symbolic substrate. On this view, L1$\rightarrow$L2$\rightarrow$L3 is a progression not only in rollout depth, but in how laws are discovered, composed, and revised. Practical instantiations or implmentations across regimes are surveyed in Section `\ref{sec:implementation}`{=latex}. In the next Section `\ref{subsec:formal_defs}`{=latex}, we introduce a foundational formalism that is instantiation-agnostic.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{2mm}
```
::: {#tab:notation_summary}
  **Symbol**                                                            **Definition**
  --------------------------------------------------------------------- -----------------------------------------------------------------------------------------------
  *Environment*
  $\mathcal{E} = (\mathcal{X}, \mathcal{A}, \Omega, T, O, R, \gamma)$   POMDP environment tuple
  $x_t$                                                                 Hidden environment state at time $t$
  $o_t$                                                                 Observation at time $t$ (pixels, tokens, audio, etc.)
  $a_t$                                                                 Action at time $t$
  $T(x_{t+1}\mid x_t, a_t)$                                             Environment transition kernel
  $O(o_t \mid x_t)$                                                     Environment observation (emission) model
  $R,\;\gamma$                                                          Reward function and discount factor
  *Learned World-Model Components*
  $z_t$                                                                 Learned latent / internal state
  $q_\phi(z_t \mid o_{\le t}, a_{\le t-1})$                             State inference (encoder / filter); parameters $\phi$
  $p_\theta(z_t \mid z_{t-1}, a_t)$                                     Forward dynamics (one-step latent transition); parameters $\theta$
  $p_\psi(o_t \mid z_t)$                                                Observation decoder; parameters $\psi$
  $\pi_\eta(a_t \mid z_{t-1}, z_t)$                                     Inverse dynamics model; parameters $\eta$
  $\hat p(\cdot)$                                                       Trajectory-level (composed) distribution; hat marks approximate object
  *Trajectories and Planning*
  $a_{1:H}=(a_1,\ldots,a_H)$                                            Action sequence of horizon length $H$
  $\tau=(z_1,\dots,z_H)$                                                Future latent segment (anchored at $z_0$)
  $\hat p(\tau \mid z_0, a_{1:H}, c)$                                   L2 rollout query: trajectory distribution conditioned on anchor, actions, and constraints $c$
  $b_t;\;\mathrm{Bel}(b_t, a_t, o_{t+1})$                               Classical belief state and Bayesian belief update
  $\pi$                                                                 Policy (consumes world-model queries; not part of the world-model factorization)
  *L3 Model Revision*
  $\mathcal{M}_t$                                                       World-modeling stack at revision step $t$
  $d_t$                                                                 Deployment evidence (trajectories, errors, tests)
  $\mathcal{H}$                                                         Hypothesis space for model revision

  : **Notation summary used in this paper.**
:::

Notations {#subsec:notation_foundations}
---------

The preceding section proposed three capability stages from epistemological intuition. We now fix a unified symbol system, and Section `\ref{subsec:formal_defs}`{=latex} uses it to give each stage a precise definition. To cover model-based RL, predictive representation learning, video/world simulation, and generative modeling, we ground the notation in a Partially Observable Markov Decision Process (POMDP) [@kaelbling1998pomdp; @puterman1994mdp]. Figure `\ref{fig:l1l2l3_graphical}`{=latex} places this POMDP structure at the heart of the three-level taxonomy: each capability stage is visualized as a highlighted scope on the same graphical model. The environment is denoted by the tuple $$\mathcal{E} = (\mathcal{X}, \mathcal{A}, \Omega, T, O, R, \gamma),$$ where $\mathcal{X}$ is the (unobserved) state space, $\mathcal{A}$ the action space, and $\Omega$ the observation space (pixels, tokens, audio, etc.). Transitions and observations follow $$x_{t+1} \sim T(x_{t+1}\mid x_t, a_t),
\qquad
o_t \sim O(o_t \mid x_t).$$ Under partial observability, agents maintain a belief $b_t$ or a learned latent state $z_t$. Classical belief updates are written $b_{t+1} = \mathrm{Bel}(b_t, a_t, o_{t+1})$; we reserve the symbol $\tau$ for latent trajectories below. Learned systems infer latents from history: $$z_t = f_\phi(o_{\le t}, a_{\le t-1})
\quad\text{or}\quad
q_\phi(z_t \mid o_{\le t}, a_{\le t-1}).$$

-   $T, O$: environment transition and observation mechanisms.

-   $q_\phi(\cdot)$: inference (history $\rightarrow$ latent).

-   $p_\theta(\cdot)$: learned local predictive or generative factors (one-step dynamics, decoders, etc.), with parameters $\theta$ (and analogously $\phi,\psi$ for inference and rendering).

-   $\hat p(\cdot)$: trajectory-level (or otherwise composed) distributions; the hat marks an explicit approximate object, e.g. the rollout marginal induced by repeated application of $p_\theta$.

-   $\pi,\,R,\,\gamma$: planner / policy, reward, and discount. These consume world-model queries but are not part of the world-model factorization $(q_\phi,p_\theta,p_\psi)$; the conceptual separation is discussed in Section `\ref{subsubsec:wm_vs_planner}`{=latex}.

**Convention:** $\hat p$ is reserved for composed objects such as $\hat p(\tau\mid z_0,a_{1:H},c)$; ordinary one-step dynamics are always written $p_\theta(z_t\mid z_{t-1},a_t)$. Table `\ref{tab:notation_summary}`{=latex} provides a concise reference for the symbols used in this paper.

$a_{1:H}=(a_1,\ldots,a_H)$ denotes an action sequence of length $H$ applied starting immediately after an anchor state $z_0$. The future segment is $$\tau \;=\; (z_1,z_2,\ldots,z_H),$$ so that $\hat p(\tau \mid z_0, a_{1:H}, c)$ matches the L2 formalism in Section `\ref{sec:l2}`{=latex}. From an arbitrary time index $t$, the same convention applies after a trivial shift: anchor at $z_t$, condition on $a_{t+1:t+H}$.

Definitions of Capabilities {#subsec:formal_defs}
---------------------------

With the symbol system established in Section `\ref{subsec:notation_foundations}`{=latex}, we now give each capability stage a precise definition with testable boundary conditions.

```{=latex}
\begin{definition}[L1 Predictor]An L1 world model provides local predictive operators that factorize into up to four components:
\begin{align}
\text{Inference / filtering:} \quad & q_\phi(z_t \mid o_{\le t}, a_{\le t-1}), \label{eq:l1_inf} \\
\text{Forward dynamics:} \quad & p_\theta(z_t \mid z_{t-1}, a_t)
\quad \text{or, without actions, } p_\theta(z_t\mid z_{t-1}), \label{eq:l1_dyn} \\
\text{Observation decoder:} \quad & p_\psi(o_t \mid z_t), \label{eq:l1_dec} \\
\text{Inverse dynamics:} \quad & \pi_\eta(a_t \mid z_{t-1}, z_t). \label{eq:l1_inv}
\end{align}
\end{definition}
```
These operators target **one-step** (or short-horizon) accuracy under the training distribution; no guarantee is made about the coherence of multi-step composition. Section `\ref{sec:l1}`{=latex} presents representative methods in detail.

```{=latex}
\begin{definition}[L2 Simulator]An L2 world model extends L1 from local operators to \textbf{decision-usable multi-step simulation}.
It must support trajectory-level queries of the form
\[
\hat p(\tau \mid z_0, a_{1:H}, c), \qquad \tau=(z_1,\ldots,z_H),
\]
subject to three boundary conditions that collectively mark L1~$\to$~L2:
\begin{enumerate}[leftmargin=*]
  \item \textbf{Long-horizon coherence:} rollouts remain usable over \(H\) steps rather than degrading immediately via compounding error.
  \item \textbf{Intervention sensitivity:} counterfactual edits (action or premise changes) induce stable and directionally meaningful trajectory changes.
  \item \textbf{Constraint consistency:} generated futures respect the governing laws of the target regime (the physical, digital, social, or scientific world).
\end{enumerate}
\end{definition}
```
The key difference from L1 is not one-step quality but **rollout fidelity under composition**.

The three L2 boundary conditions are complementary rather than redundant. Long-horizon coherence concerns whether rollout quality survives composition over time; intervention sensitivity concerns whether changes in actions or premises induce stable and directionally meaningful changes in the predicted future; and constraint consistency concerns whether the resulting trajectories remain valid under the governing laws of the target regime. None of these implies the others in general: a model may generate coherent but action-insensitive rollouts, or action-sensitive rollouts that still violate domain constraints. In practice they can also trade off against one another, for example when aggressive constraint enforcement stabilizes trajectories at the cost of reduced responsiveness to interventions.

A fourth capability, **closed-loop use** (supporting planning, acting, and self-improvement through interaction with the modeled environment), further separates world modeling from generic prediction but is orthogonal to L1/L2/L3: a weather emulator can be an L2 world model with no embedded planner (see Appendix `\ref{app:boundaries}`{=latex} for extended discussion). We reserve \`\`closed-loop" for two different senses that must not be conflated: using a world model inside a control or planning loop is an orthogonal deployment property, whereas revising the world-model stack itself from deployment evidence is the defining hallmark of L3.

```{=latex}
\begin{definition}[L3 Evolver]An L3 world model extends L2 from rollout over a fixed scaffold to \textbf{evidence-driven model revision}.
In addition to simulation queries, an L3 system maintains an explicit update loop over model assets:
\[
(\mathcal{M}_t,\; d_t) \;\xrightarrow{\;\text{diagnose\,+\,distill\,+\,validate}\;}\; \mathcal{M}_{t+1},
\]
where \(\mathcal{M}_t\) is the current world-modeling stack at revision step \(t\) and \(d_t\) is new deployment evidence (trajectories, errors, counterexamples, tests).
Three boundary conditions mark L2~$\to$~L3:
\begin{enumerate}[leftmargin=*]
  \item \textbf{Evidence-grounded diagnosis:} failures are attributed to actionable causes using replayable evidence.
  \item \textbf{Persistent asset update:} fixes are promoted as reusable assets (skills, rules, parsers, tests), not only ephemeral in-context patches.
  \item \textbf{Governed validation:} updates pass regression and robustness gates (including rollback and canary policies) before default enablement.
\end{enumerate}
\end{definition}
```
The key difference from L2 is that the **model itself becomes an object of revision**, not merely a fixed scaffold to be queried [@lu2024aiscientist; @boiko2023autonomous]. Recapping the scopes in Figure `\ref{fig:l1l2l3_graphical}`{=latex}: **L1 (Predictor)** is a single-step transition $p_\theta(z_t\mid z_{t-1}, a_{t-1})$ with its supporting inference and decoding operators, acting locally on one edge of the latent chain; **L2 (Simulator)** composes those local operators into a trajectory $\hat p(\tau\mid z_0, a_{1:H}, c)$ under a fixed model $\mathcal{M}_t$ and governing-law constraint $c$; and **L3 (Evolver)** revises the model stack $\mathcal{M}_t\to\mathcal{M}_{t+1}$ from distilled evidence $d_t$, yielding a different latent graph (the bottom block of the figure) whose effective environment $\mathcal{E}'\!\sim\!\mathcal{X}'$ may differ from the original, whether because the world itself has shifted, because the agent has uncovered previously unmodeled structure, or because the hypothesis space has been expanded. The three levels form a containment hierarchy: L2 invokes L1 at each step, and L3 invokes L2 each time it probes the world for evidence before committing to a model update.

#### Agent-centered view: state, action, and task.

The formal components above describe an agent whose decisions are determined by three elements: the state it believes the world to be in, the action it can execute, and the task (or constraint $c$) it must satisfy. This triple, not a flat observation-to-action mapping, defines the interface between world model and planner. Building a useful $z_t$ involves two orthogonal challenges that structure Section `\ref{sec:l1}`{=latex}: (i) *spatial representation*: compressing a high-dimensional observation $o_t$ into a compact latent that retains decision-relevant structure (geometry, semantics, affordances), and (ii) *temporal fusion*: integrating history $(o_{\le t}, a_{\le t-1})$ so that $z_t$ approximates a Markov belief even in partially observable settings. Actions are not flat variables: they can emerge from representation learning rather than being pre-defined, with the core dynamics captured by the latent representation and everything else serving as a decoder [@lecun2022path]. Real agent behavior decomposes across temporal scales and abstraction levels, including low-level motor primitives, mid-level skills, and high-level task plans. The world model must predict transitions at the granularity that matches the planner's query horizon. This action hierarchy interacts directly with the L1$\to$L2 boundary: local dynamics suffice for primitive-level prediction [@sun2025learningprimitiveembodiedworld], but skill- and task-level rollouts require the multi-step coherence that defines L2. At the L3 level, the agent must not only predict transitions across temporal scales but also decide when its own transition model is inadequate and initiate model revision. L3 treats the world-modeling stack itself as an object of action. Diagnostic probes, architecture modifications, and regression tests become \`\`meta-actions" that operate on the model rather than on the environment itself, reshaping how the system learns rather than merely how it acts.

Scope of Laws {#subsec:scope_regimes}
-------------

As introduced in Section `\ref{sec:scope}`{=latex}, we organize the survey along two orthogonal axes: capability level (L1/L2/L3) and governing-law regime. This subsection elaborates the four regimes and the constraints each imposes on the learned transition function. We distinguish **Laws of the Physical World** (governing agents that perceive and act in physical environments), **Laws of the Digital World** (governing deterministic program semantics: code, APIs, and state machines), **Laws of the Social World** (governing the dynamics of minds and institutions: beliefs, goals, and norms), and **Laws of the Scientific World** (governing systems that exist independently of human design, whose dynamics must be discovered from empirical observation). These four regimes are representative, not exhaustive. Real-world systems often operate under multiple regimes simultaneously. For example, autonomous driving involves both physical dynamics and social norms, while drug design couples natural mechanisms with digital simulation pipelines.

**Laws of the Physical World** constrain transitions through the physical dynamics that embodied agents must respect: contact mechanics, collision response, gravitational acceleration, friction, and kinematic feasibility. In robotics manipulation, autonomous driving, and interactive 3D simulation, the learned transition $p_\theta(z_t \mid z_{t-1}, a_t)$ must encode these physical interactions faithfully [@todorov2012mujoco; @hu2023gaia1; @wang2024drivedreamer]. This regime is distinguished by analytically characterizable governing equations. A physics engine or analytic model can verify whether a predicted transition is consistent with rigid-body constraints and Newtonian mechanics. Constraint violations appear as objects passing through each other, gravity reversing mid-rollout, or physically impossible deformations. Such failures are immediately detectable because the ground-truth dynamics admit closed-form or numerically exact reference solutions.

**Laws of the Digital World** constrain transitions through deterministic program semantics, including API contracts, UI state machines, file-system logic, and network protocols. In web navigation, code generation, and software testing, the transition function $p_\theta(z_t \mid z_{t-1}, a_t)$ is largely *deterministic* but branches heavily through error codes, permission checks, and edge cases [@gu2024webdreamer; @yao2022webshop]. This regime is defined by transitions that are both *specifiable* and *verifiable*. The program can be executed and its output compared against the model's prediction. Constraint violations appear as producing an API call that does not exist, ignoring returned error codes, or violating type constraints. Because the underlying system is a formal artifact, such errors are mechanically checkable.

**Laws of the Social World** constrain transitions through beliefs, goals, norms, social contracts, and institutional rules. In social simulation, dialogue systems, and multi-agent interaction, $p_\theta(z_t \mid z_{t-1}, a_t)$ maps joint actions and mental states to new mental states and social outcomes [@park2023generative; @zhou2025socialwm]. Two properties set this regime apart. Transitions are *reflexive*, meaning that agents' beliefs about the state actively change the state itself. They are also *normative*, governed not only by what will happen but by what should happen according to shared conventions. Constraint violations appear as breaking a promise without consequence, forgetting a prior commitment, or ignoring established social norms. Such failures undermine coherence because social outcomes depend on mutual expectation.

**Laws of the Scientific World** constrain transitions through latent causal mechanisms that must be *discovered* from empirical observation rather than specified *a priori*. In weather prediction, molecular dynamics, protein folding, and drug design, $p_\theta(z_t \mid z_{t-1}, a_t)$ encodes atmospheric dynamics, chemical kinetics, or biological processes whose exact functional forms are unknown or too complex to write analytically [@karniadakis2021pinn; @lam2023graphcast; @abramson2024alphafold3]. This regime differs in that the governing equations are *not available in closed form*. The world model must learn them from data and be validated against experimental measurement. Constraint violations appear as predicting physically impossible molecular configurations, violating conservation laws that hold empirically, or ignoring known causal dependencies. Detection typically requires comparison with laboratory or observational data rather than symbolic verification.

With these foundations in place, the following sections instantiate each capability level in turn: Section `\ref{sec:l1}`{=latex} surveys L1 methods, Section `\ref{sec:l2}`{=latex} addresses L2 simulation, and Section `\ref{sec:l3}`{=latex} examines L3 model revision. Appendix `\ref{app:boundaries}`{=latex} clarifies the distinctions between world modeling and generic prediction, world models and planners, and world modeling and the commonsense reasoning that agents rely on in unscripted settings.

L1 Predictor: Local Markov Prediction {#sec:l1}
=====================================

The hierarchical structure begins with **L1**, which assesses a world model's local predictive ability by requiring it to sustain a meaningful internal state and use *local* predictive mechanisms to anticipate the next state, including potential observations or actions. In the unified graphical model of Figure `\ref{fig:l1l2l3_graphical}`{=latex}, L1 is the scope of a single edge $z_{t-1}\to z_t$ conditioned on action $a_{t-1}$; everything in this section elaborates the operators that populate this one-step transition and examines how they are realized in contemporary world-model systems.

Definition {#subsec:l1_definition}
----------

L1 concerns the local predictive ability of a world model for an agent acting in an environment to accomplish a task or goal. More precisely, an agent is a system that, given observations, makes decisions and takes actions in order to satisfy an objective. In this paper, the role of an L1 world model is therefore not merely to predict the next signal, but to provide local predictive operators that support such decision-making at the granularity of one step (or a short fixed horizon). This epistemic stance aligns with Hume's constant conjunction: regularities are extracted from observed data without claiming causal necessity (Section `\ref{sec:philosophy_l1}`{=latex}).

The POMDP formulation that underlies L1 originates from the reinforcement learning literature, where an agent must select actions under partial observability to maximize cumulative reward [@kaelbling1998pomdp; @puterman1994mdp]. In this setting, the agent maintains an internal belief over hidden states and crafts a policy $\pi(a_t \mid b_t)$ that maps beliefs to actions. This formulation constitutes the prototypical agent--environment loop [@sutton1991dyna]. For an agent that interacts with the environment to accomplish a task, the POMDP decomposes into four local operators: state inference, forward dynamics, observation decoding, and inverse dynamics. Together these describe the foundational learning problems for world models at the L1 level.

Following this formulation (Section `\ref{sec:preliminaries}`{=latex}), **L1** is characterized by **local predictive operators** operating on a learned internal state $z_t$ (resembling a belief state), where the central modeling concept centers on a **one-step** (or short fixed-horizon) transition operator. In practical terms, $z_t$ is deduced from observations and actions and functions as a learned approximation to the latent environmental state and/or belief [@hafner2023dreamerv3; @schrittwieser2020muzero]. The concept of learning such latent dynamics can be traced back to locally linear latent models for control [@watter2015e2c] and Gaussian-process dynamics [@deisenroth2011pilco], and has been enhanced by contemporary deep learning architectures [@ha2018worldmodels; @hafner2019dreamer]. The term \`\`Markov" in **L1** denotes the *Markovian property in the learned internal state* $z_t$, indicating that $z_t$ is adequate (or nearly adequate) for predicting the subsequent local step, rather than the direct observability of the environmental state [@hafner2019recurrent; @hafner2023dreamerv3; @gelada2019deepmdp].

At the model level, L1 factorizes into four local operators over $z_t$ (Table `\ref{tab:l1_architecture}`{=latex}). The *core* operator is latent dynamics ($z_{t-1} \to z_t$); the others are common supporting operators:

-   **State inference** (observation $\to$ state, Eq. `\eqref{eq:l1_inf}`{=latex}): $z_t = f_\phi(o_{\le t}, a_{\le t-1})$ or $q_\phi(z_t \mid o_{\le t}, a_{\le t-1})$. The learned belief-like state summarizes relevant history for prediction [@hafner2019recurrent; @lesort2018srl].

-   **Forward dynamics** (state $\to$ next state; core L1 operator, Eq. `\eqref{eq:l1_dyn}`{=latex}): $z_t \sim p_\theta(z_t \mid z_{t-1}, a_t)$ (action-conditioned) or $z_t \sim p_\theta(z_t \mid z_{t-1})$ (action-free).

-   **Observation decoding** (state $\to$ observation, Eq. `\eqref{eq:l1_dec}`{=latex}): $p_\psi(o_t \mid z_t)$, mapping latent state back to observation space [@kingma2014vae; @rezende2014stochastic].

-   **Inverse dynamics** (Eq. `\eqref{eq:l1_inv}`{=latex}): $\pi_\eta(a_t \mid z_{t-1}, z_t)$, used as an auxiliary objective or for representation shaping [@pathak2017curiosity; @hafner2019dreamer].

```{=latex}
\renewcommand{\arraystretch}{1.3}
```
```{=latex}
\setlength{\tabcolsep}{1mm}
```
::: {#tab:l1_architecture}
  **Operator**                                   **Mapping**                                                                **Formal Definition**               **Role**
  ---------------------------------------------- -------------------------------------------------------------------------- ----------------------------------- ------------------------------------------------------------------------
  State Inference                                $o_t \to z_t$
  or $q_\phi(z_t \mid o_{\le t}, a_{\le t-1})$   `\shortstack[l]{Compress observations into latent belief state}`{=latex}
  Forward Dynamics                               $z_{t-1} \to z_t$                                                          $p_\theta(z_t \mid z_{t-1}, a_t)$   Predict next latent state given action
  Observation Decoding                           $z_t \to o_t$                                                              $p_\psi(o_t \mid z_t)$              `\shortstack[l]{Reconstruct observations as training signals}`{=latex}
  Inverse Dynamics                               $(z_{t-1}, z_t) \to a_t$                                                   $\pi_\eta(a_t \mid z_{t-1}, z_t)$   Infer actions; representation shaping

  : **L1 component factorization.** The four local operators form the building blocks of L1 world models. The core operator is forward dynamics; the others are supporting operators.
:::

```{=latex}
\begin{table*}[!t]\caption{\textbf{Representative L1 methods.} Columns indicate which local operators each method instantiates: state inference~(SI), forward dynamics~(FD), observation decoding~(OD), and inverse dynamics~(ID).}
\label{tab:l1_methods}

\setlength{\tabcolsep}{0.75mm}
% \scriptsize
\begin{tabular}{l|cc|ccccl}
\toprule
\textbf{Method} & \textbf{Links} & \textbf{SI} & \textbf{FD} & \textbf{OD} & \textbf{ID} & \textbf{Architecture} \\
\midrule
\rowcolor{gray!15}
\textit{Representation Learning} \\
VAE~\citep{kingma2014vae} & \paperlink{https://arxiv.org/abs/1312.6114} & {--} & \cmark & \xmark & \cmark & \xmark & MLP encoder--decoder \\
$\beta$-VAE~\citep{higgins2017betavae} & \paperlink{https://openreview.net/forum?id=Sy2fzU9gl} & {--} & \cmark & \xmark & \cmark & \xmark & MLP encoder--decoder \\
VQ-VAE~\citep{oord2017vqvae} & \paperlink{https://arxiv.org/abs/1711.00937} & \githublink{https://github.com/MishaLaskin/vqvae} & \cmark & \xmark & \cmark & \xmark & CNN + discrete codebook \\
CPC~\citep{oord2018cpc} & \paperlink{https://arxiv.org/abs/1807.03748} & {--} & \cmark & \xmark & \xmark & \xmark & CNN + autoregressive \\
SimCLR~\citep{chen2020simclr} & \paperlink{https://arxiv.org/abs/2002.05709} & \githublink{https://github.com/google-research/simclr} & \cmark & \xmark & \xmark & \xmark & ResNet + projection head \\
MoCo~\citep{he2020moco} & \paperlink{https://arxiv.org/abs/1911.05722} & \githublink{https://github.com/facebookresearch/moco} & \cmark & \xmark & \xmark & \xmark & Momentum encoder \\
CURL~\citep{srinivas2020curl} & \paperlink{https://arxiv.org/abs/2004.04136} & \githublink{https://github.com/MishaLaskin/curl} & \cmark & \xmark & \xmark & \xmark & CNN+momentum encoder \\
SPR~\citep{schwarzer2021spr} & \paperlink{https://arxiv.org/abs/2007.05929} & \githublink{https://github.com/mila-iqia/spr} & \cmark & \xmark & \xmark & \xmark & CNN + prediction MLP \\
I-JEPA~\citep{assran2023ijepa} & \paperlink{https://arxiv.org/abs/2301.08243} & \githublink{https://github.com/facebookresearch/ijepa} & \cmark & \xmark & \xmark & \xmark & ViT + predictor \\
V-JEPA~\citep{bardes2024vjepa} & \paperlink{https://arxiv.org/abs/2404.08471} & \githublink{https://github.com/facebookresearch/jepa} & \cmark & \xmark & \xmark & \xmark & ViT + predictor \\
DINOv2~\citep{oquab2024dinov2} & \paperlink{https://arxiv.org/abs/2304.07193} & \githublink{https://github.com/facebookresearch/dinov2} & \cmark & \xmark & \xmark & \xmark & ViT self-distillation \\
\midrule
\rowcolor{gray!15}
\textit{Model-Based RL} \\
PILCO~\citep{deisenroth2011pilco} & \paperlink{https://dl.acm.org/doi/10.5555/3104482.3104541} & \githublink{https://github.com/UCL-SML/pilco-matlab} & \xmark & \cmark & \xmark & \xmark & Gaussian process \\
E2C~\citep{watter2015e2c} & \paperlink{https://arxiv.org/abs/1506.07365} & {--} & \cmark & \cmark & \cmark & \xmark & Locally linear latent \\
PETS~\citep{chua2018pets} & \paperlink{https://arxiv.org/abs/1805.12114} & \githublink{https://github.com/kchua/handful-of-trials} & \xmark & \cmark & \xmark & \xmark & Ensemble of NNs \\
World Models~\citep{ha2018worldmodels} & \paperlink{https://arxiv.org/abs/1803.10122} & \githublink{https://github.com/hardmaru/WorldModelsExperiments} & \cmark & \cmark & \cmark & \xmark & VAE + MDN-RNN \\
Dreamer~\citep{hafner2019dreamer} & \paperlink{https://arxiv.org/abs/1912.01603} & \githublink{https://github.com/danijar/dreamer} & \cmark & \cmark & \cmark & \xmark & RSSM (GRU + stoch.) \\
DreamerV2~\citep{hafner2020dreamerv2} & \paperlink{https://arxiv.org/abs/2010.02193} & \githublink{https://github.com/danijar/dreamerv2} & \cmark & \cmark & \cmark & \xmark & RSSM (discrete stoch.) \\
DreamerV3~\citep{hafner2023dreamerv3} & \paperlink{https://arxiv.org/abs/2301.04104} & \githublink{https://github.com/danijar/dreamerv3} & \cmark & \cmark & \cmark & \xmark & RSSM + symlog \\
MuZero~\citep{schrittwieser2020muzero} & \paperlink{https://arxiv.org/abs/1911.08265} & {--} & \cmark & \cmark & \xmark & \xmark & MLP dynamics + MCTS \\
EfficientZero~\citep{ye2021efficientzero} & \paperlink{https://arxiv.org/abs/2111.00210} & \githublink{https://github.com/YeWR/EfficientZero} & \cmark & \cmark & \xmark & \xmark & MuZero + self-sup.\ \\
TD-MPC2~\citep{hansen2024tdmpc2} & \paperlink{https://arxiv.org/abs/2310.16828} & \githublink{https://github.com/nicklashansen/tdmpc2} & \cmark & \cmark & \xmark & \xmark & MLP latent dynamics \\
DeepMDP~\citep{gelada2019deepmdp} & \paperlink{https://arxiv.org/abs/1906.02736} & {--} & \cmark & \cmark & \xmark & \xmark & Bellman-aligned latent \\
MBPO~\citep{janner2019mbpo} & \paperlink{https://arxiv.org/abs/1906.08253} & \githublink{https://github.com/jannerm/mbpo} & \xmark & \cmark & \xmark & \xmark & Ensemble of NNs \\
\midrule
\rowcolor{gray!15}
\textit{Token / Diffusion-Based} \\
IRIS~\citep{micheli2023iris} & \paperlink{https://arxiv.org/abs/2209.00588} & \githublink{https://github.com/eloialonso/iris} & \cmark & \cmark & \cmark & \xmark & VQ-VAE + Transformer \\
TransDreamer~\citep{chen2022transdreamer} & \paperlink{https://arxiv.org/abs/2202.09481} & \githublink{https://github.com/changchencc/TransDreamer} & \cmark & \cmark & \cmark & \xmark & Transformer-XL + stoch. \\
Latent Diffusion~\citep{rombach2022latentdiffusion} & \paperlink{https://arxiv.org/abs/2112.10752} & \githublink{https://github.com/CompVis/latent-diffusion} & \cmark & \xmark & \cmark & \xmark & Latent-space diffusion \\
STORM~\citep{zhang2023storm} & \paperlink{https://arxiv.org/abs/2310.09615} & \githublink{https://github.com/weipu-zhang/STORM} & \cmark & \cmark & \cmark & \xmark & Transformer + VAE \\
DIAMOND~\citep{alonso2024diamond} & \paperlink{https://arxiv.org/abs/2405.12399} & \githublink{https://github.com/eloialonso/diamond} & \xmark & \cmark & \xmark & \xmark & Pixel-space diffusion \\
Delta-IRIS~\citep{micheli2024deltairis} & \paperlink{https://arxiv.org/abs/2406.19320} & \githublink{https://github.com/vmicheli/delta-iris} & \cmark & \cmark & \cmark & \xmark & VQ-VAE + delta coding \\
\bottomrule
\end{tabular}%
\end{table*}
```
Approaches {#subsec:l1_methods}
----------

We categorize notable **L1** techniques based on the four local operators delineated earlier: state inference (concerned with deriving $z_t$ from observations and historical data), forward dynamics (the fundamental transition model), observation decoding (associating $z_t$ with $o_t$), and inverse dynamics (deducing actions from successive states) [@ding2024survey_wm; @moerland2023mbrl]. Table `\ref{tab:l1_methods}`{=latex} summarizes representative methods and their key innovations. We devote the most space to forward dynamics because it is the operator that most directly determines whether an L1 system can later be elevated into an L2 simulator; the other components are still essential, but their role is primarily to make the latent state usable for that transition.

### State Inference

State inference condenses high-dimensional observations into a compact latent representation $z_t$ that preserves crucial decision-making information, and it integrates temporal context to ensure that $z_t$ approximates a Markovian belief in partially observable scenarios [@lesort2018srl].

Contrastive Predictive Coding (CPC; @oord2018cpc) trains an encoder to maximize mutual information between present and future embeddings via the InfoNCE loss, which contrasts temporally adjacent positive pairs against negatives drawn from the same batch. SimCLR [@chen2020simclr] and MoCo [@he2020moco] established general-purpose contrastive frameworks through augmentation-based positive pairs and momentum-updated encoders, respectively, providing pretrained visual backbones that downstream world models build upon. However, general visual representations do not guarantee that $z_t$ preserves control-relevant information. CURL [@srinivas2020curl] addressed this by extending contrastive learning to RL with temporal adjacency between consecutive frames as positive pairs, achieving sample efficiency comparable to model-based methods in Atari and continuous control. Self-Predictive Representations (SPR; @schwarzer2021spr) trains an encoder to forecast its own future representations, incorporating temporal structure and decision-relevant dynamics. Both methods show strong sample efficiency on Atari 100k, confirming that world model representations benefit from being tailored to decision-making objectives.

Rather than contrasting pairs, another family of methods predicts the embedding of masked regions directly in latent space. JEPA and its variants, I-JEPA [@assran2023ijepa] and V-JEPA [@bardes2024vjepa], forecast hidden-region embeddings without decoding back to pixels [@lecun2022path]. This approach encourages the encoder to grasp semantic and structural consistencies without being bound to intricate reconstruction at the pixel level. On another front, DINOv2 [@oquab2024dinov2] from the foundation-model domain generates versatile visual features through self-distillation, establishing robust state encoders for subsequent tasks. A complementary direction makes the inferred state explicitly object-centric and programmatic rather than purely continuous. Thinking with Blueprints converts an image into a JSON-style blueprint that records the positions, sizes, and attributes of question-relevant objects, and then reasons over this structured representation to answer spatial queries [@ma2026thinking]. Although proposed for VLM spatial reasoning rather than sequential control, it is highly relevant to L1 state inference because it shows that useful internal state can take the form of a decision-oriented scene description, not only a dense latent embedding.

A third line of work shapes $z_t$ through control-oriented auxiliary objectives such as reward anticipation, inverse-model losses [@pathak2017curiosity], and value-function consistency, as formalized in DeepMDP [@gelada2019deepmdp]. This framework articulates the necessity for the latent Markov chain to approximately adhere to the Bellman equations. Embed to Control (E2C; @watter2015e2c) acquired locally linear latent dynamics simultaneously with a VAE encoder-decoder for LQR-based planning within the latent space.

When a single observation is insufficient, the model must aggregate past information into $z_t$. The Recurrent State Space Model (RSSM) of @hafner2019recurrent splits the latent into a deterministic recurrent pathway $h_t = f(h_{t-1}, z_{t-1}, a_{t-1})$ and a stochastic component $z_t \sim q_\phi(z_t \mid h_t, o_t)$, compressing arbitrary-length histories while preserving stochastic uncertainty. This recurrent belief state $(h_t, z_t)$ serves as the internal state for all downstream prediction and control in the Dreamer family [@hafner2019dreamer; @hafner2020dreamerv2; @hafner2023dreamerv3].

Scientific applications illustrate how the same state-inference principle operates when raw observations are high-dimensional and the scientifically meaningful state is latent. In structural biology, protein structure prediction can be cast as L1 state inference: mapping an amino acid sequence (observation) to a dominant 3D coordinate state. The AlphaFold lineage progressed from learned distance-based potentials [@senior2020alphafold] to end-to-end Evoformer architectures with near-experimental accuracy [@jumper2021alphafold] to diffusion-based prediction of joint biomolecular complex structures [@abramson2024alphafold3]. Parallel efforts showed that strong structure prediction is also achievable through three-track networks [@baek2021rosettafold] and protein language models enabling single-sequence inference [@lin2023esmfold]. In neuroscience, HMM [@baker2014hmm], RNN [@gohil2022dynemo] and Transformer [@khan2023dynemoc] are used to map electrophysiological recordings to a set of latent network modes, following a state-inference paradigm conceptually similar to @ha2018worldmodels. Analysis of the learned interpretable latent representations reveals various findings: cortical activity at rest can be described by transient, intermittently recurring events [@vidaurre2018spontaneous] organized into cycles on 300--1,000 ms timescales [@van2025large].

### Forward Dynamics: The Core L1 Operator

These approaches directly establish $p_\theta(z_t \mid z_{t-1}, a_t)$ and form the *core* of L1. The precision of the dynamics network is crucial for producing valuable one-step forecasts and must be expressive enough to be aggregated over numerous steps, a critical requirement that is more stringent at L2 [@moerland2023mbrl].

In model-based Reinforcement Learning (RL), the action-conditioned latent transition model plays a pivotal role. PILCO [@deisenroth2011pilco] employed Gaussian-process dynamics coupled with analytical uncertainty propagation for efficient continuous control. MuZero [@schrittwieser2020muzero] adopts a deterministic dynamics function $z_t = f_\theta(z_{t-1}, a_t)$ trained end-to-end for value prediction and Monte Carlo Tree Search without observation reconstruction. EfficientZero [@ye2021efficientzero] further enhanced this approach by incorporating self-supervised consistency losses, achieving superhuman performance in Atari games with just two hours of experience. Conversely, the Recurrent State Space Model (RSSM) in Dreamer [@hafner2019dreamer; @hafner2020dreamerv2; @hafner2023dreamerv3] leverages stochastic dynamics to facilitate uncertainty-aware rollouts. PETS [@chua2018pets] illustrated that an *ensemble* of dynamics models offers reliable epistemic uncertainty estimation crucial for robust planning. TD-MPC [@hansen2022tdmpc] learns latent dynamics through temporal difference objectives, directly aligning the dynamics model with value estimation; TD-MPC2 [@hansen2024tdmpc2] scales up to a single 317 million-parameter agent proficient in mastering 104 tasks across various domains.

A recent trend is the shift from continuous latent dynamics to discrete-token or diffusion-based transitions. For instance, IRIS [@micheli2023iris] tokenizes observations using a VQ-VAE codebook [@oord2017vqvae] and models the resulting sequence with an autoregressive Transformer. Meanwhile, TransDreamer [@chen2022transdreamer] swaps the GRU of RSSM with Transformer-XL to enhance long-range attention. @zhang2023storm combine Transformer sequence modeling with stochastic VAE dynamics, whereas @micheli2024deltairis encode stochastic deltas between frames instead of entire frames. On the diffusion front, DIAMOND [@alonso2024diamond] employs diffusion denoising as the one-step transition operator to preserve visual intricacies that may be overlooked by low-capacity latent dynamics.

Building on the principles of predictive coding [@rao1999predictive; @friston2010free], methodologies such as CPC [@oord2018cpc], SPR [@schwarzer2021spr], and JEPA [@bardes2024vjepa; @assran2023ijepa] forecast the *latent embedding* of the forthcoming observation rather than the observation itself. While SimPLe [@kaiser2019simple] showcased the viability of pixel-level video prediction as a world model for efficient Atari RL, the discrepancy in prediction fidelity between pixel-space and latent-space underscores the progression towards abstract dynamics and the transition from L1 to L2.

Beyond using dreaming for policy optimization, one-step dynamics models also function as generators of experience: MBPO [@janner2019mbpo] integrates short model rollouts into a replay buffer to enhance sample efficiency; @nagabandi2018mbmf demonstrated that model-based pre-training combined with model-free fine-tuning leverages the strengths of both approaches. DayDreamer [@wu2023daydreamer] transfers latent imagination to physical robots, @wang2024coworld extend this to transfer reinforcement learning (RL) knowledge across visual domains, and @hao2025mosim propel world models towards acquiring long-horizon physical skills through neural motion simulation. These applications underscore that L1 dynamics serve as a data catalyst, a planning foundation [@schrittwieser2020muzero; @hafner2023dreamerv3; @hansen2024tdmpc2], and a mechanism for compressing raw interactions into high-level behaviors [@moerland2023mbrl].

### Observation Decoding

The decoder implements $p_\psi(o_t \mid z_t)$ and has three key functions. It provides a training signal to ensure that $z_t$ preserves sufficient information, serves as a diagnostic interface to examine the model's learned representations, and acts as a rendering engine for generating envisioned observations during dreaming [@ha2018worldmodels; @hafner2019dreamer].

The Variational Autoencoder (VAE; @kingma2014vae, independently @rezende2014stochastic) offers a standard probabilistic framework: the encoder $q_\phi(z_t \mid o_t)$ maps observations to a latent posterior, while the decoder $p_\psi(o_t \mid z_t)$ reconstructs observations from latents, trained simultaneously through the ELBO. In world-modeling workflows, the VAE compresses raw pixel inputs into a concise code transmitted to a dynamics model [@ha2018worldmodels; @hafner2019recurrent]. $\beta$-VAE [@higgins2017betavae] amplifies the KL divergence term to encourage disentangled factors, while VQ-VAE [@oord2017vqvae] replaces continuous latents with a discrete codebook, foundational for token-based world models [@micheli2023iris]. The *World Models* concept by @ha2018worldmodels introduced the fusion of a VAE encoder, an LSTM dynamics model (MDN-RNN), and a distinct controller. Subsequently, the Dreamer lineage [@hafner2019dreamer; @hafner2020dreamerv2; @hafner2023dreamerv3] expanded this paradigm into an end-to-end latent imagination framework, where policy and value functions are trained solely on imagined latent trajectories, with the decoder acting mainly as a regularizer and auxiliary fidelity check rather than as a generative objective in its own right.

Large-scale video generation models like Sora [@brooks2024sora] leverage high-capacity observation decoders to produce photorealistic frames from latent trajectories. Latent Diffusion Models [@rombach2022latentdiffusion] compress images into a lower-dimensional latent space and apply diffusion processes more efficiently in that space, while the Diffusion Transformer (DiT; @peebles2023dit) enhances scalability by replacing the U-Net backbone with a standard Transformer. Recent image-generation backbones also explore information-adaptive tokenization and generation, such as Dynamic Generative Image Transformer [@mao2026dgit], which suggests a useful design direction for observation-level decoders even though it is not itself a multi-step world model. These models illustrate the feasibility of achieving high-quality $p_\psi(o_t \mid z_t)$ at scale. However, the quality of the *latent dynamics* that steer the decoder, and the ability of $z_t$ to facilitate coherent multi-step prediction, remain challenges that motivate the transition to the L2 discussion.

### Inverse Dynamics

The inverse dynamics operator $\pi_\eta(a_t \mid z_{t-1}, z_t)$ deduces the action taken between two consecutive latent states. This operator serves multiple roles in modern world-model systems. @pathak2017curiosity utilized inverse dynamics as a curiosity-driven exploration strategy, refining the representation to capture only the controllable aspects of the environment. By training the encoder to predict actions between consecutive states, the inverse model filters out exogenous visual noise (e.g., moving clouds, flickering backgrounds) that is irrelevant to the agent's decisions. In a broader context, inverse dynamics acts as an additional training signal that prompts $z_t$ to preserve action-relevant characteristics, complementing forward dynamics and reconstruction as a mechanism for learning decision-useful representations [@lesort2018srl].

A particularly impactful application of inverse dynamics is retrospective action labeling for large-scale imitation learning. @baker2022video trained an inverse dynamics model on a small corpus of action-labeled Minecraft gameplay and then applied it to label a vastly larger set of unlabeled internet videos, enabling Video PreTraining (VPT) to learn complex behaviors such as diamond mining from passive observation alone. This pipeline demonstrates that inverse dynamics can bridge the gap between abundant unlabeled video and the action annotations required for behavioral cloning, effectively transforming observation-only data into a usable training signal for policy learning. Inverse dynamics also underpins goal-conditioned policy architectures. Given a current state $z_{t-1}$ and a desired goal state $z_g$, the inverse model predicts the action that would transition toward the goal, providing a natural interface for hierarchical planning where a high-level planner selects subgoals and a low-level inverse model executes them [@ghosh2021learning]. @agrawal2016learning demonstrated that a robot can learn intuitive physics by \`\`poking" objects and training an inverse model to predict the poke parameters from observed state changes, illustrating how inverse dynamics can ground physical understanding through interaction.

Many effective world models exclude inverse dynamics entirely, while others use it as a light regularization technique. A practical but under-discussed issue is action-label quality: when actions are inferred retrospectively rather than logged directly, inverse-model errors accumulate precisely at distribution edges where world models most need reliable supervision. Furthermore, the inverse operator assumes a unique or near-unique action between state pairs, an assumption that breaks down in stochastic environments or when multiple actions lead to the same outcome, limiting its reliability in such settings.

Discussion {#subsec:l1_theory_boundaries}
----------

Although this paper focuses on temporal local prediction indexed by time t, the same local-operator view can apply along non-temporal axes such as diffusion steps, refinement steps, or hierarchical update stages. We treat these as edge cases of L1 rather than as a distinct capability level, because the key property is still local transition prediction rather than decision-usable multi-step rollout.

L1 alone does not ensure coherent behavior over long horizons. Challenges such as compounding one-step errors [@janner2019mbpo; @chua2018pets], maintaining consistency across numerous steps, and lacking methods for intervention or counterfactual reasoning highlight the need for L2. Likewise, neither L1 nor L2 inherently adapt the model based on new evidence; this capability is the focus of L3. The fundamental distinction between L1 and L2 lies in whether the system is formulated and assessed for *multi-step rollout accuracy and constraint adherence*, rather than solely focusing on one-step prediction precision [@hafner2023dreamerv3; @ding2024survey_wm; @moerland2023mbrl]. The limitation of L1 is not that one-step prediction is unimportant, but that local predictive quality alone does not guarantee decision-usable behavior under composition. The practical question is thus when short-horizon operators stop meeting the planner's needs.

L2 Simulator: Decision-Usable Multi-Step Simulation {#sec:l2}
===================================================

Where L1 answers *\`\`what is the next local state given the current state and action?"*, L2 answers a decision-relevant question: *\`\`if the agent executes a candidate action sequence under task constraints, what future trajectory is likely to unfold?"* This elevation turns one-step operators into a simulator that an agent can query before committing to action, thus providing an *imagination* of the future without requiring real-environment interaction. Model-based planning exploits exactly this capability: by rolling out candidate plans inside a learned model, the agent compares outcomes and selects the most promising course of action [@sutton1991dyna; @hafner2023dreamerv3; @schrittwieser2020muzero]. An important corollary is that any system used to generate synthetic training data for an agent implicitly serves as a world model, since it must produce state transitions realistic enough to support policy improvement [@gu2024webdreamer; @webevolver2025]. It is worth noting that, decision-usable simulation focuses on plausible dynamics, which holds its distinction between L1 where state changes are arbitrary. For example, a cup passing through a solid table, a car drifting through lane boundaries without consequence, or a social commitment silently vanishing each represent failures to preserve the governing invariants of the target regime.

Table `\ref{tab:l2_boundary_conditions}`{=latex} maps the three L2 boundary conditions to concrete instantiations in each governing-law regime. More precisely, an L2 system supports trajectory-level queries of the form $$\hat p(\tau \mid z_0, a_{1:H}, c), \quad \tau=(z_1,\ldots,z_H),$$ where $a_{1:H}$ denotes an action sequence and $c$ denotes optional constraints imposed by the governing-law regime. Intervention-structured rollouts align with the interventional rung of Pearl's causal hierarchy (Section `\ref{sec:philosophy_l2}`{=latex}). What separates L2 from L1 is not one-step predictive quality alone, but **coherent multi-step rollout under the governing laws**. L2 thus stitches per-edge L1 operators into a full trajectory $z_0\to z_1\to\cdots\to z_H$ (top block of Figure `\ref{fig:l1l2l3_graphical}`{=latex}).

```{=latex}
\begin{table*}[t]\caption{\textbf{L2 boundary conditions instantiated by governing-law regime.} Each cell specifies what the abstract condition means concretely in that domain.}
\label{tab:l2_boundary_conditions}

\footnotesize
\setlength{\tabcolsep}{0.9mm}
\begin{tabularx}{\textwidth}{@{}l *{4}{>{\arraybackslash}X}@{}}
\toprule
& \textbf{Physical World} & \textbf{Digital World} & \textbf{Social World} & \textbf{Scientific World} \\
\midrule
\textbf{Coherence}
& Object persistence and stable contacts over $H$-step manipulation sequences
& DOM/file-system consistency across multi-step UI/code interactions
& Commitment and relationship stability across multi-turn dialogue
& Causal chain validity across experimental sequences \\
\addlinespace
\hline
\addlinespace
\textbf{Sensitivity}
& Force/placement perturbation alters grasp outcome proportionally
& UI failure injection (pop-ups, timeouts) causes appropriate replan
& Changing one agent's strategy shifts negotiation outcome
& Parameter change produces directionally correct measurement shift \\
\addlinespace
\hline
\addlinespace
\textbf{Consistency}
& No interpenetration, energy conservation, kinematic feasibility
& API contract adherence, type constraints, state-machine validity
& Norm compliance, belief consistency, reflexive social dynamics
& Conservation laws, causal graph consistency, evidence-chain validity \\
\bottomrule
\end{tabularx}
\end{table*}
```
Requirements for Elevation {#subsec:l2_requirements}
--------------------------

Composing L1's local operators over multiple steps does not automatically yield a decision-usable simulator: compounding errors, action-insensitive rollouts, and violated domain invariants can each render the resulting trajectories misleading for planning. This echoes the classical frame problem [@mccarthy1969some; @shanahan1997frame], in particular, local transition rules alone do not specify which properties should remain invariant under action, while the concern here is operational rather than logical. The interface between a planner and an L2 world model is the *query*: given an action sequence $a_{1:H}$ from state $z_0$ under constraints $c$, the model returns rollouts that the planner uses to compare candidates and select the one maximizing an objective. We treat closed-loop use (planning, acting, or control in interaction with an environment) as an orthogonal deployment property: the level boundary is determined by the depth and reliability of the world-model query, not by whether the system operates in a feedback loop. We use three boundary conditions to mark the elevation from L1 to L2:

1.  **Long-horizon coherence:** rollouts remain usable over multiple steps, rather than degrading immediately through compounding error.

2.  **Intervention sensitivity:** counterfactual edits, for example, changing actions, premises, or controllable inputs. These induce stable and directionally meaningful trajectory changes.

3.  **Constraint consistency:** generated futures respect the governing-law constraints of the target regime, whether physical, digital, social, or scientific.

These are not merely conceptual distinctions; together they induce a practical test for whether a system deserves to be called an L2 simulator. A candidate system should be evaluated not only on one-step prediction quality, but also on whether performance remains decision-usable as rollout horizon increases, whether counterfactual interventions produce coherent and policy-relevant divergence, and whether generated trajectories continue to satisfy regime-specific validity constraints. A model that predicts the next step accurately yet collapses under composition, ignores action edits, or violates domain rules remains better understood as L1 with strong local prediction, rather than as a full L2 simulator.

#### From L1 to L2.

At L1, composing one-step operators yields a trajectory distribution that factorizes as $\hat p(\tau \mid z_0, a_{1:H}) = \prod_{t=1}^{H} p_\theta(z_t \mid z_{t-1}, a_t)$, with each step optimized independently; the trajectory is an unregulated byproduct. At L2, the governing-law constraint c couples steps together: conceptually, $$\hat p(\tau \mid z_0, a_{1:H}, c) \propto \prod_{t=1}^{H} p_\theta(z_t \mid z_{t-1}, a_t)\,\phi_c(\tau),$$ where $\phi_c(\tau)$ is a governing-law compatibility term over the full rollout. The hard-indicator case $\mathbf{1}[c(\tau)]$ is a special case of $\phi_c(\tau)$ when violations are treated as strictly inadmissible. Because $\phi_c(\tau)$ depends on the entire trajectory, the L2 distribution does not factorize into independent per-step terms.

Each requirement maps to a diagnostic signal and a mitigation strategy. *Long-horizon fidelity* is diagnosed by a success cliff at a specific horizon $H$. The primary mitigation of this is task segmentation with frequent replanning. *Action controllability* is diagnosed by action insensitivity across rollouts (changing $a_t$ produces no meaningful trajectory change), where mitigation requires explicit action-consistency evaluation. *Constraint consistency* is measured by the constraint-violation rate. In such cases, mitigations include hard constraint layers and verification gates. A fourth property, *calibration*, requires that confidence aligns with actual accuracy under distribution shift; in such cases, overconfident wrong predictions signal failure, and distribution-shift detection is the main remedy.

#### Residual frame-problem manifestations.

Modern neural world models sidestep the classical frame problem's representational burden by learning implicitly from data what persists and what changes [@goodfellow2016deep; @hafner2023dreamerv3], enabling scalable model-based RL [@hafner2019dreamer; @schrittwieser2020muzero; @moerland2023mbrl] and video prediction [@babaeizadeh2018sv2p; @brooks2024sora] without explicit frame axioms. Yet the problem resurfaces at rollout time: context-window limits and hallucination cause models to lose track of relevant past information, violating long-horizon coherence, while rare preconditions under-represented in training data undermine constraint consistency [@ding2024survey_wm; @shanahan1997frame]. These failure modes motivate the techniques surveyed below (see Appendix `\ref{app:boundaries}`{=latex}).

Applications {#subsec:l2_app}
------------

In this section, we categorize L2 systems into four governing-law regimes. Tables `\ref{tab:l2_systems_phys_dig}`{=latex} and `\ref{tab:l2_systems_soc_sci}`{=latex} provide anchor systems for cross-regime comparison and summarize how each domain instantiates the boundary conditions.

```{=latex}
\begin{table*}[ht]\caption{\textbf{Representative L2 anchor systems:} Physical and Digital Worlds.
Columns indicate long-horizon coherence (\textbf{LH}), intervention sensitivity (\textbf{IS}),
and constraint consistency (\textbf{CC}). The table is a compact comparison set, not an exhaustive inventory of all systems discussed in the prose.}
\label{tab:l2_systems_phys_dig}

\small
\setlength{\tabcolsep}{0.9mm}
\begin{tabular}{l|cc|cccl}
\toprule
\textbf{Method} & \textbf{Links} & \textbf{LH} & \textbf{IS} & \textbf{CC} & \textbf{Architecture} \\
\midrule
\rowcolor{gray!15}
\textit{Physical World} \\
MuZero~\citep{schrittwieser2020muzero} & \paperlink{https://arxiv.org/abs/1911.08265} & {---} & \cmark & \cmark & \cmark & MLP dynamics + MCTS \\
Plan2Explore~\citep{sekar2020plan2explore} & \paperlink{https://arxiv.org/abs/2005.05960} & \githublink{https://github.com/ramanans1/plan2explore} & \cmark & \cmark & \xmark & Dreamer + self-supervised exploration \\
PathDreamer~\citep{koh2021pathdreamer} & \paperlink{https://arxiv.org/abs/2105.08756} & \githublink{https://github.com/google-research/pathdreamer} & \cmark & \xmark & \xmark & Autoregressive visual VLN \\
DreamerPro~\citep{deng2022dreamerpro} & \paperlink{https://arxiv.org/abs/2110.14565} & \githublink{https://github.com/fdeng18/dreamer-pro} & \cmark & \cmark & \xmark & RSSM + prototypical representations \\
DreamingV2~\citep{okada2022dreamingv2} & \paperlink{https://arxiv.org/abs/2203.00494} & {---} & \cmark & \cmark & \xmark & DreamerV2 + reconstruction-free \\
Diffuser~\citep{janner2022diffuser} & \paperlink{https://arxiv.org/abs/2205.09991} & \githublink{https://github.com/jannerm/diffuser} & \cmark & \xmark & \xmark & Diffusion trajectory planning \\
DreamerV3~\citep{hafner2023dreamerv3} & \paperlink{https://arxiv.org/abs/2301.04104} & \githublink{https://github.com/danijar/dreamerv3} & \cmark & \cmark & \cmark & RSSM + symlog loss \\
DayDreamer~\citep{wu2023daydreamer} & \paperlink{https://arxiv.org/abs/2206.14176} & \githublink{https://github.com/danijar/daydreamer} & \cmark & \cmark & \cmark & RSSM real-world robots \\
GAIA-1~\citep{hu2023gaia1} & \paperlink{https://arxiv.org/abs/2309.17080} & {---} & \cmark & \cmark & \xmark & Transformer video generation \\
DIAMOND~\citep{alonso2024diamond} & \paperlink{https://arxiv.org/abs/2405.12399} & \githublink{https://github.com/eloialonso/diamond} & \cmark & \cmark & \cmark & U-Net diffusion \\
Sora~\citep{brooks2024sora} & \paperlink{https://openai.com/index/video-generation-models-as-world-simulators/} & {---} & \cmark & \xmark & \xmark & DiT video diffusion \\
Genie~\citep{bruce2024genie} & \paperlink{https://arxiv.org/abs/2402.15391} & {---} & \cmark & \cmark & \xmark & ST-transformer + VQ actions \\
iVideoGPT~\citep{wu2024ivideogpt} & \paperlink{https://arxiv.org/abs/2405.15223} & \githublink{https://github.com/thuml/iVideoGPT} & \cmark & \cmark & \xmark & Transformer + VQ-VAE \\
OccWorld~\citep{zheng2024occworld} & \paperlink{https://arxiv.org/abs/2311.16038} & \githublink{https://github.com/wzzheng/OccWorld} & \cmark & \cmark & \cmark & GPT 3D occupancy prediction \\
Vista~\citep{gao2024vista} & \paperlink{https://arxiv.org/abs/2405.17398} & \githublink{https://github.com/OpenDriveLab/Vista} & \cmark & \cmark & \xmark & Diffusion driving generation \\
DriveDreamer~\citep{wang2024drivedreamer} & \paperlink{https://arxiv.org/abs/2309.09777} & \githublink{https://github.com/JeffWang987/DriveDreamer} & \cmark & \cmark & \xmark & Diffusion AD generation \\
Copilot4D~\citep{zhang2024copilot4d} & \paperlink{https://arxiv.org/abs/2311.01017} & {---} & \cmark & \cmark & \cmark & VQ-VAE + point diffusion \\
LWM~\citep{liu2024lwm} & \paperlink{https://arxiv.org/abs/2402.08268} & \githublink{https://github.com/LargeWorldModel/LWM} & \cmark & \xmark & \xmark & RingAttention long-context LLM \\
DreMa~\citep{wu2024drema} & \paperlink{https://arxiv.org/abs/2412.14957} & \githublink{https://github.com/leobarcellona/drema_code} & \cmark & \cmark & \xmark & Compositional 3DGS twins \\
Cosmos~\citep{nvidia2025cosmos} & \paperlink{https://arxiv.org/abs/2501.03575} & \githublink{https://github.com/NVIDIA/Cosmos} & \cmark & \cmark & \xmark & Autoregressive + diffusion hybrid \\
Aether~\citep{zhu2025aether} & \paperlink{https://arxiv.org/abs/2503.18945} & \githublink{https://github.com/OpenRobotLab/Aether} & \cmark & \cmark & \cmark & CogVideoX geometry fine-tune \\
PIN-WM~\citep{li2025pinwm} & \paperlink{https://arxiv.org/abs/2504.16693} & \githublink{https://github.com/XuAdventurer/PIN-WM} & \cmark & \cmark & \cmark & Differentiable rigid-body + 3DGS \\
Yume~\citep{mao2025yume} & \paperlink{https://arxiv.org/abs/2507.17744} & \githublink{https://github.com/stdstu12/YUME} & \cmark & \cmark & \xmark & Video diffusion world generation \\
GAIA-2~\citep{nvidia2025gaia2} & \paperlink{https://arxiv.org/abs/2503.20523} & {---} & \cmark & \cmark & \xmark & Latent diffusion multi-view AD \\
RoboScape~\citep{chen2025roboscape} & \paperlink{https://arxiv.org/abs/2506.23135} & \githublink{https://github.com/tsinghua-fib-lab/RoboScape} & \cmark & \cmark & \xmark & Physics-informed robot video \\
BridgeV2W~\citep{wang2026bridgev2w} & \paperlink{https://arxiv.org/abs/2602.03793} & {---} & \cmark & \cmark & \xmark & Action-conditioned embodied video \\
HWM~\citep{zhang2026hwm} & \paperlink{https://arxiv.org/abs/2604.03208} & \githublink{https://github.com/kevinghst/HWM_PLDM} & \cmark & \cmark & \xmark & Hierarchical latent + MCTS \\
\midrule
\rowcolor{gray!15}
\textit{Digital World} \\
GameGAN~\citep{kim2020gamegan} & \paperlink{https://arxiv.org/abs/2005.12126} & \githublink{https://github.com/nv-tlabs/GameGAN_code} & \cmark & \cmark & \xmark & GAN neural game engine \\
WebDreamer~\citep{gu2024webdreamer} & \paperlink{https://arxiv.org/abs/2411.06559} & \githublink{https://github.com/OSU-NLP-Group/WebDreamer} & \cmark & \cmark & \xmark & LLM web state simulation \\
CodeWM~\citep{dainese2024codewm} & \paperlink{https://arxiv.org/abs/2405.15383} & \githublink{https://github.com/nicoladainese96/code-world-models} & \cmark & \cmark & \cmark & LLM + MCTS code generation \\
WorldCoder~\citep{tang2024worldcoder} & \paperlink{https://arxiv.org/abs/2402.12275} & \githublink{https://github.com/ma-labo/worldcoder} & \cmark & \cmark & \cmark & LLM incremental code synthesis \\
GameNGen~\citep{valevski2025gamengin} & \paperlink{https://arxiv.org/abs/2408.14837} & {---} & \cmark & \cmark & \xmark & U-Net diffusion \\
WMA~\citep{chae2025wma} & \paperlink{https://arxiv.org/abs/2410.13232} & \githublink{https://github.com/kyle8581/WMA-Agents} & \cmark & \cmark & \xmark & LLM web transition prediction \\
WebSynthesis~\citep{gao2025websynthesis} & \paperlink{https://arxiv.org/abs/2507.04370} & \githublink{https://github.com/LucusFigoGao/WebSynthesis} & \cmark & \cmark & \xmark & LLM + MCTS planning \\
NeuralOS~\citep{rivard2025neuralos} & \paperlink{https://arxiv.org/abs/2507.08800} & \githublink{https://github.com/yuntian-group/neural-os} & \cmark & \cmark & \xmark & RNN + pixel diffusion \\
GameFactory~\citep{yu2025gamefactory} & \paperlink{https://arxiv.org/abs/2501.08325} & \githublink{https://github.com/KwaiVGI/GameFactory} & \cmark & \cmark & \xmark & Action-controlled video generation \\
GameCraft~\citep{li2025gamecraft} & \paperlink{https://arxiv.org/abs/2506.17201} & \githublink{https://github.com/Tencent-Hunyuan/Hunyuan-GameCraft-1.0} & \cmark & \cmark & \xmark & Diffusion game video generation \\
MobileDreamer~\citep{cao2026mobiledreamer} & \paperlink{https://arxiv.org/abs/2601.04035} & {---} & \cmark & \cmark & \xmark & LLM GUI sketch prediction \\
Word2World~\citep{li2025wordtoworld} & \paperlink{https://arxiv.org/abs/2512.18832} & \githublink{https://github.com/X1AOX1A/Word2World} & \cmark & \cmark & \xmark & LLM text-based WM \\
Code2World~\citep{zheng2026code2world} & \paperlink{https://arxiv.org/abs/2602.09856} & \githublink{https://github.com/AMAP-ML/Code2World} & \cmark & \cmark & \cmark & VLM code rendering \\
gWorld~\citep{gworld2026} & \paperlink{https://arxiv.org/abs/2602.01576} & \githublink{https://github.com/trillion-labs/gWorld} & \cmark & \cmark & \cmark & VLM code rendering \\
WebWorld~\citep{xiao2026webworld} & \paperlink{https://arxiv.org/abs/2602.14721} & {---} & \cmark & \cmark & \xmark & Fine-tuned VLM web simulator \\
RWML~\citep{yu2026rwml} & \paperlink{https://arxiv.org/abs/2602.05842} & {---} & \cmark & \cmark & \xmark & LLM + RL sim-to-real \\
% GameWorld~\citep{ouyang2026gameworld} & \paperlink{https://arxiv.org/pdf/2604.07429} & \githublink{https://github.com/gameworld-project/gameworld} & \cmark & \cmark & \cmark & Browse game engine \\
\bottomrule
\end{tabular}%
\end{table*}
```
```{=latex}
\begin{table*}[ht]\caption{\textbf{Representative L2 anchor systems (continued):} Social and Scientific Worlds.
Columns indicate long-horizon coherence (\textbf{LH}), intervention sensitivity (\textbf{IS}),
and constraint consistency (\textbf{CC}).
Links are provided where available; the table is a compact comparison set, not an exhaustive inventory of all systems discussed in the prose.
}
\label{tab:l2_systems_soc_sci}

\small
\setlength{\tabcolsep}{0.9mm}
\begin{tabular}{l|cc|cccl}
\toprule
\textbf{Method} & \textbf{Links} & \textbf{LH} & \textbf{IS} & \textbf{CC} & \textbf{Architecture} \\
\midrule
\rowcolor{gray!15}
\textit{Social World} \\
Deal or No Deal~\citep{lewis2017dealornodeal} & \paperlink{https://arxiv.org/abs/1706.05125} & \githublink{https://github.com/facebookresearch/end-to-end-negotiator} & \cmark & \cmark & \cmark & RNN + RL self-play \\
Social Simulacra~\citep{park2022socialsimulacra} & \paperlink{https://dl.acm.org/doi/10.1145/3526113.3545616} & {---} & \cmark & \cmark & \xmark & GPT prompt-chain community simulation \\
CICERO~\citep{meta2022cicero} & \paperlink{https://doi.org/10.1126/science.ade9097} & \githublink{https://github.com/facebookresearch/diplomacy_cicero} & \cmark & \cmark & \cmark & LLM + strategic planning \\
Generative Agents~\citep{park2023generative} & \paperlink{https://arxiv.org/abs/2304.03442} & \githublink{https://github.com/joonspk-research/generative_agents} & \cmark & \cmark & \cmark & LLM reflective memory \\
Sotopia~\citep{zhou2024sotopia} & \paperlink{https://arxiv.org/abs/2310.11667} & \githublink{https://github.com/sotopia-lab/sotopia} & \cmark & \cmark & \cmark & LLM social evaluation \\
AvalonBench~\citep{light2023avalonbench} & \paperlink{https://arxiv.org/abs/2310.05036} & \githublink{https://github.com/jonathanmli/Avalon-LLM} & \cmark & \cmark & \cmark & LLM deductive reasoning \\
Werewolf~\citep{xu2024werewolf} & \paperlink{https://arxiv.org/abs/2310.18940} & \githublink{https://github.com/xuyuzhuang11/Werewolf} & \cmark & \cmark & \cmark & LLM + RL strategic policy \\
ProjectSid~\citep{altera2024projectsid} & \paperlink{https://arxiv.org/abs/2411.00114} & \githublink{https://github.com/altera-al/project-sid} & \cmark & \cmark & \cmark & LLM multi-agent civilization simulation \\
OASIS~\citep{yang2024oasis} & \paperlink{https://arxiv.org/abs/2411.11581} & \githublink{https://github.com/camel-ai/oasis} & \cmark & \cmark & \cmark & LLM social simulation \\
MASim~\citep{zhang2025masim} & \paperlink{https://arxiv.org/abs/2512.07195} & {---} & \cmark & \cmark & \xmark & Multilingual agent simulation \\
SWM-AP~\citep{zhang2025swmap} & \paperlink{https://arxiv.org/abs/2510.19270} & {---} & \cmark & \cmark & \xmark & Social WM mechanism design \\
AIvilization~\citep{fan2026aivilization} & \paperlink{https://arxiv.org/abs/2602.10429} & {---} & \cmark & \cmark & \cmark & Sandbox economy simulation \\
PolicySim~\citep{huang2026policysim} & \paperlink{https://arxiv.org/abs/2603.19649} & \githublink{https://github.com/renH2/PolicySim} & \cmark & \cmark & \xmark & LLM platform policy sandbox \\
\midrule
\rowcolor{gray!15}
\textit{Scientific World} \\
GNS~\citep{sanchez2020gns} & \paperlink{https://arxiv.org/abs/2002.09405} & \githublink{https://github.com/deepmind/deepmind-research} & \cmark & \xmark & \cmark & GNN message passing \\
ChemBO~\citep{pmlr-v108-korovina20a} & \paperlink{https://proceedings.mlr.press/v108/korovina20a.html} & \githublink{https://github.com/kamikaze0923/ChemBo} & \cmark & \xmark & \cmark & GP + synthesis-graph BO \\
P3BO~\citep{angermueller2020population} & \paperlink{https://proceedings.mlr.press/v119/angermueller20a.html} & {---} & \cmark & \cmark & \xmark & Adaptive population-based optimization \\
FNO~\citep{li2021fno} & \paperlink{https://arxiv.org/abs/2010.08895} & \githublink{https://github.com/neuraloperator/neuraloperator} & \cmark & \xmark & \cmark & Fourier neural operator \\
Pangu-Weather~\citep{bi2023panguweather} & \paperlink{https://arxiv.org/abs/2211.02556} & \githublink{https://github.com/198808xc/Pangu-Weather} & \cmark & \xmark & \cmark & 3D Earth transformer \\
ClimaX~\citep{nguyen2023climax} & \paperlink{https://arxiv.org/abs/2301.10343} & \githublink{https://github.com/microsoft/ClimaX} & \cmark & \xmark & \cmark & ViT climate foundation \\
GraphCast~\citep{lam2023graphcast} & \paperlink{https://arxiv.org/abs/2212.12794} & \githublink{https://github.com/google-deepmind/graphcast} & \cmark & \xmark & \cmark & GNN autoregressive weather \\
GenCast~\citep{price2024gencast} & \paperlink{https://arxiv.org/abs/2312.15796} & \githublink{https://github.com/google-deepmind/graphcast} & \cmark & \xmark & \cmark & Spherical ensemble diffusion \\
NeuralGCM~\citep{kochkov2024neuralgcm} & \paperlink{https://arxiv.org/abs/2311.07222} & \githublink{https://github.com/google-research/neuralgcm} & \cmark & \xmark & \cmark & Hybrid physics--NN core \\
BAX~\citep{chitturi2024targeted} & \paperlink{https://www.nature.com/articles/s41524-024-01326-2} & \githublink{https://github.com/sathya-chitturi/multibax-sklearn} & \cmark & \cmark & \cmark & GP + user-directed acquisition \\
Aurora~\citep{bodnar2025aurora} & \paperlink{https://arxiv.org/abs/2405.13063} & \githublink{https://github.com/microsoft/aurora} & \cmark & \xmark & \cmark & 3D Swin weather foundation \\
Lingshu-Cell~\citep{zhang2026lingshucell} & \paperlink{https://arxiv.org/abs/2603.25240} & {---} & \cmark & \xmark & \cmark & Masked diffusion cellular WM \\
\bottomrule
\end{tabular}%
\end{table*}
```
### Laws of the Physical World {#subsec:l2_physical}

In the physical domain, L2 models should respect geometry, kinematics, and conservation laws. The governing constraints are contact, reachability, stability, and energy conservation; violations of any of these will mislead a planner into proposing actions that fail catastrophically in real execution.

#### Physics simulation.

*Rigid-body control simulators.* Classical physics simulators remain the foundation layer for executable transition validity in embodied world modeling. MuJoCo provides articulated rigid-body dynamics and contact-rich control, with dm\_control packaging these capabilities into a standardized continuous-control suite [@todorov2012mujoco; @tassa2020dmcontrol]. Brax pushes differentiable rigid-body simulation toward accelerator-scale throughput [@freeman2021brax], while Isaac Gym and Isaac Lab emphasize massive GPU-parallel robotics simulation [@makoviychuk2021isaacgym; @nvidia2025isaaclab].

*Scalable and general-purpose simulation platforms.* Genesis positions itself as a generative and universal physics engine [@genesis2024], reflecting the broader trend toward higher-throughput simulators that can jointly support both control and large-scale synthetic-data generation.

*Interaction-centric embodied simulators.* At the graphics-and-robotics interface, SAPIEN provides part-aware, interaction-centric simulation, and ManiSkill3 scales GPU-parallel rendering for generalizable embodied AI [@xiang2020sapien; @tao2024maniskill3]. These systems are not learned simulators; they are explicit law executors whose value lies in precise contact handling, articulated constraints, and reproducible rollouts.

#### Video generation models.

*Appearance-first long-horizon video generation.* A scalable route to physical-world simulation is the *video interface*: given current observations and optional actions, the model returns imagined future frames. This line begins with appearance-first rollout, where systems such as Sora, Lumiere, and VideoPoet demonstrate coherent visual dynamics over extended horizons [@brooks2024sora; @bartal2024lumiere; @kondratyuk2024videopoet], with geometry-aware structure increasingly emerging beyond pixel-level realism [@li2024sora_geometry]. FramePack [@zhang2025framepack] and Self-Forcing [@huang2025selfforcing] reduce long-horizon drift through frame-context packing.

*Action-conditioned and interactive video worlds.* A second direction moves from passive continuation toward intervention-aware generation. Genie learns latent action spaces from unlabeled Internet video [@bruce2024genie], while GAIA-1 conditions future generation on explicit control signals for counterfactual evaluation [@hu2023gaia1]. More recent systems push this line toward real-time, long-horizon, and streaming interaction: Oasis explores open-ended interactive generation in a unified transformer world [@oasis2024]; WorldPlay emphasizes long-term geometric consistency for real-time interactive world modeling [@sun2025worldplay]; Matrix-Game 3.0 extends interactive generation to streaming settings with explicit long-horizon memory [@matrixgame3]; Yume-1.5 studies text-controlled interactive world generation [@mao2025yume]; and LongLive targets real-time interactive long video generation [@yang2025longlive]. Taken together, these systems mark a shift from passive video prediction toward controllable, intervention-aware, and temporally persistent video worlds.

*Decision-oriented video world models.* In model-based RL, SimPLe [@kaiser2019simple] and DIAMOND [@alonso2024diamond] make the decision-theoretic role of video world models explicit. In robotics, DreamZero [@ye2026dreamzero] and DreamDojo [@gao2026dreamdojo] demonstrate zero-shot and generalist policy learning via video world models, while FutureVLA [@xu2026futurevla] couples visuomotor prediction directly with Vision-Language-Action policies to unify perception and control.

*Evaluation and limitations.* Within our L2 framing, however, visual plausibility does not equal decision-usability. Intervention sensitivity remains fragile, long-horizon coherence is easily overstated when judged by perceptual quality alone [@guo2025logic], and constraint consistency is difficult to verify from rendered frames. Standard metrics such as FVD [@unterthiner2018fvd] capture distributional realism; VBench-style suites [@huang2023vbench; @huang2025vbench++] better decompose controllability; VBench-2.0 [@zheng2025vbench2] extends evaluation to physics consistency and commonsense reasoning; and VChain [@huang2025vchain] introduces visual chain-of-thought for causal coherence. Video interfaces are the most scalable observation-layer entry point, but planner-critical structure remains implicit in pixels; Appendix `\ref{app:l2_extended}`{=latex} surveys geometry-carrying alternatives that make such structure explicit.

#### Robotics and sim-to-real transfer.

*World models transferred to real robots.* DayDreamer [@wu2023daydreamer] showed that Dreamer-family world models can transfer from simulation to physical robots while handling sensor noise, contact dynamics, and actuation delays. DreamZero [@ye2026dreamzero] achieves zero-shot policy learning via world action models that predict both next states and actions, and FutureVLA [@xu2026futurevla] embeds visuomotor prediction within Vision-Language-Action models to improve action grounding.

*Physics-grounded bridges for sim-to-real robustness.* PIN-WM [@li2025pinwm] integrates differentiable physics with learned visual world modeling, creating \`\`digital cousins" via physics-aware randomization.

*Representation requirement.* Across these systems, the key question is not whether richer representations are possible, but what is the weakest representation that still preserves planner-critical structure, such as object persistence, free space, contact onset, support relations, and action-conditioned change over useful horizons. Extended details on 3D-structured world models and autonomous driving appear in Appendix `\ref{app:l2_extended}`{=latex}.

### Laws of the Digital World {#subsec:l2_software}

The Laws of the Digital World govern transitions in systems defined by formal specifications, from finite automata (UI state machines) and context-free grammars (structured data formats) to Turing-complete programs (general software). Unlike the Laws of the Physical World or the Laws of the Social World, these constraints are *explicitly specified and mechanically verifiable*: a transition either satisfies the program's semantics or it does not. Because software transitions approximate deterministic state machines and failures are loggable (error codes, popups, permission denials, timeouts), the core challenge for a Simulator in code worlds is *structured state prediction* (DOM trees, program state, game state) rather than visual fidelity.

#### Coding agents.

An emerging paradigm represents world models as executable programs rather than neural networks. CodeWM [@dainese2024codewm] uses LLMs guided by Monte Carlo Tree Search to generate Python programs that serve as explicit, interpretable world models for reinforcement learning across 18 environments. WorldCoder [@tang2024worldcoder] takes a complementary approach, with an LLM agent building a Python world model incrementally through environment interaction for sample-efficient transfer. WKM [@qiao2024wkm] provides both global task knowledge and dynamic state knowledge to guide LLM agent planning, while CWM [@copet2025cwm], a 32B open-weights LLM specifically trained for code world model research, achieves 65.8% on SWE-bench Verified. A conceptually distinct variant pushes further: rather than using an LLM to *generate* a code world model, the world model *is* a running software system. Web World Models [@feng2025wwm] implement world state as ordinary web code (TypeScript modules, HTTP handlers, database schemas), delegating logical consistency to deterministic execution of the web stack while LLMs generate context and high-level decisions. These code-based approaches yield interpretable, composable, and verifiable world models that neural dynamics can only approximate.

#### Web agents.

Web agents usually browse websites; therefore, modeling and simulating state transitions within a website is crucial for building effective web world models. WebDreamer [@gu2024webdreamer] introduced the idea of using an LLM as an implicit world model of the internet, but subsequent work showed that off-the-shelf LLMs are insufficient: dedicated training with transition-focused abstraction is needed [@chae2025wma]. A growing body of work addresses the co-evolution of agent and world model. WebEvolver [@webevolver2025] tightly couples the two in a mutual improvement loop, while DreamGym [@chen2025dreamgym] builds experience models with chain-of-thought reasoning, achieving over 30% improvement on WebArena. At larger scale, WebSynthesis [@gao2025websynthesis] combines world models with MCTS-based planning using entirely synthetic data, and WebWorld [@xiao2026webworld] trains an open-web simulator on over one million trajectories supporting 30+ step simulation. AUI [@lin2025computer] takes a different approach, employing a Coder to optimize websites by leveraging feedback from a Computer-Use Agent in an iterative collaboration loop. Orthogonal design choices include generating trajectories from tool specifications alone (Simia; @li2025simia), adding a metacognitive layer that decides *whether* to consult the world model at each step (WAC; @wac2026) and agent-collected data to handle out-of-distribution behaviors.

#### GUI agents.

GUI agents [@qin2025ui; @lin2025showui; @xu2024aguvis] typically execute actions in real environments. However, in scenarios where actions may be dangerous or lead to undesired outcomes, it is beneficial to estimate them beforehand. A GUI world model can simulate and evaluate these actions, thereby providing a more reliable assessment. Therefore, MobileDreamer [@cao2026mobiledreamer] transforms GUI images into task-related sketches for structured state prediction, while MobileWorldBench [@li2025mobileworldbench] provides systematic evaluation with 1.4 million (state, action, future state) triplets. Complementary to explicit GUI world models, UI-AGILE shows that effective reinforcement learning and precise inference-time grounding remain equally important for strong downstream GUI-agent performance [@lian2025ui]. A central design question is the output representation: ViMo [@luo2025vimo] generates future observations as images using symbolic text representation, while gWorld [@gworld2026] generates renderable web code as the predicted next state, suggesting that generating the code that renders the GUI can be more faithful than generating pixels directly. At the OS level, NeuralOS [@rivard2025neuralos] simulates desktop GUIs by predicting screen frames from user inputs, while CUWM [@cuwm2026] targets desktop software where persistent document state must be preserved across long-horizon workflows. Code2World [@zheng2026code2world] further extends this line by treating code as a renderable world, where generated programs directly produce visual states (e.g., HTML) upon execution. This enables modeling environment dynamics as executable code generation, tightly coupling perception, action, and state transition in interactive domains such as GUIs.

### Laws of the Social World {#subsec:l2_social}

Societal world models extend L2 to human interaction, where governing laws are beliefs, desires, intentions, norms, and institutions rather than physics. Social worlds exhibit three distinctive properties, in particular, *opacity* (agents cannot directly observe each other's mental states), *reflexivity* (beliefs about social state create feedback loops), and *normativity* (transitions are governed partly by shared norms). Such traits make the transition function partially constituted by collective agreement rather than natural law [@zheng2023mcu]. A usable social simulator separates surface language from underlying social state: dialogue can vary, but core states (goals, beliefs, relations, norms) must remain consistent and yield interpretable transitions, as formalized by the Rational Speech Acts framework [@goodman2016rsa; @degen2023rsa]. Concretely, a social compatibility term $\phi_c(\tau)$ can encode commitment consistency: if agent $i$ promises action $b$ at time $t$, later states receive low compatibility when $i$ violates $b$ without explanation, renegotiation, or sanction. Similar terms can score norm compliance, role consistency, or belief-state coherence over the trajectory.

#### Theory of mind as social state.

The computational foundation was laid by Bayesian ToM (BToM), which formalizes mental state inference as probabilistic inverse planning over rational agents [@baker2011bayesian]. Neural approaches began with ToMnet [@rabinowitz2018tomnet], whose character, mental state, and prediction networks jointly infer traits and beliefs, and recent work such as LaBToM [@ying2025labtom] bridges Bayesian inverse planning with formal epistemic language. However, current models lack robust mental state reasoning: FANToM [@kim2023fantom] reveals \`\`illusory ToM" across all state-of-the-art LLMs, and ExploreToM [@sclar2024exploretom] achieves accuracy as low as 9% for GPT-4o [@chen2025tomsurvey]. A complementary challenge is the *dual-structure* problem: a social agent must simultaneously model others' mental states (theory of mind) and maintain its own persistent internal state across long interactions, in particular, goals, persona, memory, and knowledge. Cognitive Architectures for Language Agents (CoALA) [@sumers2024coala] formalizes this dual structure as separate memory and action spaces that must remain mutually consistent, and provides a principled framework for understanding how current LLM agents do and do not achieve stable self-representation.

#### Strategic interaction.

CICERO [@meta2022cicero] integrates a language model with piKL planning for Diplomacy, jointly optimizing game actions and dialogue while modeling second-order beliefs, achieving more than $2\times$ the average human score. Deal or No Deal [@lewis2017dealornodeal] pioneered dialogue rollouts for forward simulation of negotiation dynamics. Werewolf and Avalon games serve as concentrated testbeds for deception, trust, and belief manipulation [@xu2024werewolf; @light2023avalonbench], revealing that deceivers consistently prevail by exploiting cognitive limitations.

#### Sandbox simulation.

Generative Agents demonstrated emergent social dynamics: a 25-agent simulation [@park2023generative] used memory-based state tracking and periodic reflection, while Sotopia [@zhou2024sotopia] formalized social simulation evaluation across seven dimensions. Scale has increased dramatically: Project Sid [@altera2024projectsid] deployed 1,000 agents exhibiting emergent specialization and governance, and OASIS [@yang2024oasis] scaled to one million agents reproducing information spreading and group polarization. At the individual level, @argyle2023silicon demonstrate \`\`silicon sampling", which conditions LLMs on specific demographic profiles to simulate survey responses from targeted subpopulations and shows strong alignment with American National Election Studies data, opening a path toward individual social world modeling. Generative Social Choice [@fish2023generativesocialchoice] extends this to democratic aggregation, using LLMs to generate representative statements from diverse synthetic participants and enabling deliberation.

#### Challenges and design principles.

Social simulation remains premature: LLMs degrade sharply beyond second-order belief reasoning [@wu2023hitom], agents suffer from role drift and goal forgetting [@park2023generative; @zhou2024sotopia], and formal commitment tracking [@telang2023commitments] remains unintegrated into any LLM architecture. A practical design pattern separates a compact social state representation (commitments, constraints, relations), a dialogue generator, and a transition updater that enforces consistency and makes state transitions loggable and replayable. Flexible persona generation is essential for populating social simulators with diverse, controllable agents; PersonaGym [@samuel2024personagym] provides a benchmark for evaluating how faithfully LLMs enact specified personas across complex social tasks, revealing systematic failures in maintaining persona consistency under adversarial probing. For personalization at the individual level, LaMP [@salemi2023lamp] introduces a benchmark of seven tasks requiring LLMs to generate outputs consistent with a specific user's history, and shows that retrieval-augmented approaches significantly close the gap. Extended details on ToM prompting, sandbox architectures, emergent phenomena, digital twins, and institutional approaches appear in Appendix `\ref{app:l2_extended}`{=latex}.

### Laws of the Scientific World {#subsec:l2_science}

In AI for Science, the transition from L1 to L2 shifts the focus from modeling local states or structures to simulating dynamics over multiple steps. These dynamics arise along two axes. The first concerns the temporal evolution of a system, where the model predicts how a natural system unfolds over time under given conditions or interventions. The second concerns the scientific research itself, where the model simulates sequences of hypotheses, experiments, and outcomes to support reasoning and action. These two forms define the corresponding forms of simulation in scientific world models: forward simulation of system dynamics, and decision simulation based on surrogate evaluation of candidate experiments.

#### Forward simulation.

World models approximate the evolution of scientific systems by replacing expensive numerical solvers with learned transition operators. GNS [@sanchez2020gns] showed that message passing on particle graphs can simulate fluids, rigid bodies, and deformable materials with generalizable dynamics. The Fourier Neural Operator [@li2021fno] established resolution-invariant operator learning via spectral convolutions, achieving 1000$\times$ speedup over traditional solvers and underpinning subsequent weather and fluid surrogates. At planetary scale, Pangu-Weather [@bi2023panguweather] and GraphCast [@lam2023graphcast] outperform the ECMWF operational system on 90% of verification targets. GenCast [@price2024gencast] extends these to probabilistic forecasting via a diffusion architecture, outperforming the ensemble system on 97.2% of targets. NeuralGCM [@kochkov2024neuralgcm] integrates learned parameterizations within a differentiable general circulation model, producing emergent phenomena such as tropical cyclones and illustrating the value of coupling mechanistic structure with learned components. Aurora [@bodnar2025aurora] further scales this paradigm to a foundation model of the Earth system, achieving strong performance across multiple forecasting tasks at substantially reduced computational cost. In molecular science, neural network potentials pioneered by @behler2007nnpotentials enabled orders-of-magnitude speedup over density functional theory for molecular dynamics, establishing the foundation for all subsequent ML fields.

#### Decision simulation.

World models reduce the cost of scientific discovery by simulating the experimental decision loop in-silico. Representative systems span molecular design (ChemBO; @pmlr-v108-korovina20a), biological sequence optimization with population-based model ensembles and meta-level search reallocation (P3BO; @angermueller2020population), and materials discovery guided by user-defined algorithmic objectives (BAX; @chitturi2024targeted). Across these systems, the model simulates not only individual outcomes but the sequential process of experiment selection, maintaining and updating beliefs over candidates while identifying inconsistencies during optimization. However, these capabilities remain confined to a fixed data regime: the model cannot actively design and execute experiments to acquire *new information* that challenges its current assumptions. As a result, while such systems can correct optimization errors, they cannot resolve uncertainty arising from incomplete knowledge, leading to accumulated bias over long horizons. L3 world models (Section `\ref{sec:l3}`{=latex}) overcome this by actively gathering evidence to revise the model.

### Cross-Domain Analysis {#subsec:l2_crossdomain}

![**Diagnostic map of the four governing-law regimes.** The axes are schematic rather than metric: the horizontal axis reflects how formally specifiable and mechanically verifiable the transition rules are, while the vertical axis reflects how directly the relevant state and constraints are observable. The purpose of the figure is comparative rather than classificatory: it highlights why different regimes demand different forms of rollout validation even when all are instances of L2 simulation. Real systems are often mixed-regime and may sit between regions rather than inside a single box.](figures/fig8_0423.png){#fig:crossdomain_regime_map width="\\textwidth"}

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{0.9mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.3}
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{l|llll}
\toprule
\textbf{Domain} & \textbf{Governing Laws} & \textbf{State Type} & \textbf{Common Failures} & \textbf{Evaluation Focus} \\
\midrule
Physical & Geometry, kinematics & Continuous & Contact instability, drift & Stability; failure clustering \\
Social & Beliefs, norms, ToM & Goals, relations & Role drift, goal forgetting & Counterfactual sensitivity \\
Digital & API contracts, UIs & DOM, permissions & Grounding breaks, races & Error-branch coverage \\
Scientific & Mechanisms, evidence & Hypotheses & Hallucinated mechanisms & Evidence-chain repair \\
\bottomrule
\end{tabular}%
}
```
Figure `\ref{fig:crossdomain_regime_map}`{=latex} positions the four regimes along two diagnostic axes: formalizability and observability of the governing constraints. Across all four regimes, a recurring pattern emerges: a good Simulator does not have to look more like the world; it must look more like the constraints. Physics uses geometry/contact constraints; software uses state machines and structured feedback channels; social worlds use role/norm consistency; science uses evidence chains and falsifiability. Making constraints explicit (loggable, replayable, regressable) often improves long-horizon stability more than increasing perceptual fidelity. Table `\ref{tab:l2_crossdomain}`{=latex} summarizes the governing laws, state types, common failure modes, and evaluation focus for each regime.

#### Cross-regime systems.

Many real-world deployments do not fall neatly into a single governing-law regime; instead, they require an L2 simulator to maintain coherent rollouts across *multiple* constraint families simultaneously. When regimes interact, a violation in one domain can cascade into another: a physically implausible vehicle maneuver may render a social-intent prediction meaningless, or a software bug may invalidate an otherwise sound experimental plan. Designing and evaluating cross-regime systems therefore demands joint constraint satisfaction rather than per-regime evaluation in isolation.

-   **Autonomous driving:** physical (vehicle dynamics, contact mechanics) + social (pedestrian intent prediction, traffic norm compliance) [@hu2023gaia1; @wang2024drivedreamer].

-   **Minecraft agents (Voyager):** physical (3D navigation, combat dynamics) + digital (crafting recipes, inventory management, game-state logic) [@wang2023voyager].

-   **Diplomacy (CICERO):** social (negotiation, trust modeling, alliance formation) + digital (game-state management, rule enforcement) [@meta2022cicero].

-   **Autonomous laboratories (A-Lab):** scientific (experiment design, hypothesis evaluation) + physical (sample manipulation, instrument constraints) [@szymanski2023alab; @boiko2023autonomous].

Failure Modes {#subsec:l2_failure_modes}
-------------

Across all four domains, five recurring failure modes constrain L2 systems:

1.  **Compounding error.** Small per-step deviations are amplified over time, pushing imagined trajectories into branches increasingly unrelated to reality. The most effective mitigations are not making one-step predictions look better, but shortening effective planning windows (decomposing long tasks into verifiable short segments and replanning frequently with real feedback), using multi-timescale structure [@shaj2023multitimescale], and baking evidence-gathering actions into policies.

2.  **State aliasing and drift.** In complex environments, distinct real states can look highly similar (two UI pages, slightly different kitchen layouts, or a one-word change in social tone). When representations collapse these states, agents can take irreversible wrong actions. Effective practices include explicit verification at key nodes, memory and retrieval augmentation, and explicit failure attribution labels [@xie2024osworld; @yang2025macosworld; @nasiriany2024robocasa].

3.  **Controllability failure.** A visually rich model that is weakly action-conditioned is less useful for planning than a rough model that responds to actions. When the model is action-insensitive, comparing \`\`do A vs. do B" becomes meaningless [@wu2024ivideogpt; @liu2024lwm; @brooks2024sora; @deepmind2025genie3].

4.  **Exploitability and simulator escape.** If a simulator or evaluation harness has loopholes, search/planning will exploit them systematically. This is especially common in software worlds and automated evaluation [@xie2024osworld; @yang2025macosworld; @zheng2023mcu].

5.  **Calibration failure under distribution shift.** Environment changes (UI versions, layouts, accents, object properties) often trigger overconfident wrong predictions. In practice, confident but wrong should be treated as a strong signal for evolution [@xie2024osworld; @yang2025macosworld; @nasiriany2024robocasa].

These failures are not merely model shortcomings; they are system-level pathologies produced by the interaction between representation, rollout horizon, control procedure, and evidence quality. The takeaway is that improving average-case predictions is insufficient unless systems can (i) localize failures via evidence and (ii) change behavior under shift and exploit pressure.

This constraints-first lens is also a useful guide for choosing what to log and what to regress-test. If the core constraint is violated (e.g., impossible action succeeds, or a structured feedback channel disappears), the agent will learn the wrong lessons. Conversely, when constraints are explicit and stable, even simple agents can improve reliably via Evolver-style asset distillation [@xie2024osworld; @yang2025macosworld; @nasiriany2024robocasa; @zheng2023mcu; @ghugare2025builderbench].

L3 Evolver: Evidence-Driven Model Revision {#sec:l3}
==========================================

Scientific discovery is naturally organized as a loop: a researcher *designs* experiments, *executes* them, *observes* outcomes, and *reflects* to guide the next step. Recent systems realize one or more components of this process, but many operate without a fully autonomous loop. Decision simulators (Section `\ref{subsec:l2_science}`{=latex}) operate within the *design* step: they simulate experimental outcomes and update beliefs, but all updates occur within a fixed information regime without active data collection. At the other end, RL-based self-reflection systems such as VL-Rethinker [@wang2025vlrethinker] realize the *reflect* step through explicit verify-and-rethink behavior, but do not maintain a persistent world-model stack. Many automated research pipelines [@yang2024moose; @li2024ldc] chain *design* and *reflect* (hypothesis generation and literature synthesis) but lack the *execute* and *observe* components that would close the full loop.

The key distinction of L3 lies in how new information is acquired and used. Rather than passively fitting incoming data or exploiting a fixed model for planning, an L3 system actively designs interventions to reduce uncertainty in its own world model. In the unified view of Figure `\ref{fig:l1l2l3_graphical}`{=latex}, this revision is the vertical *reflect* arrow that connects the top block (model $\mathcal{M}_t$ operating over environment $\mathcal{X}$) to the bottom block (revised model $\mathcal{M}_{t+1}$ operating over an effective environment $\mathcal{X}'$): each iteration of the loop modifies the latent graph that L2 rolls out on. Each iteration of the loop targets discrepancies between prediction and observation, using them to refine parameters, extend model structure, or revise underlying assumptions. In this sense, L3 is driven not by reward maximization alone, but by the systematic reduction of model uncertainty through evidence accumulated across many iterations of design, execution, and reflection.

The meta-learning paradigm, notably \`\`learning to learn" [@andrychowicz2016learning], which casts optimizer design itself as a learning problem, foreshadows this self-revision capability: a system improves its own learning procedure rather than merely fitting data. L3 extends this principle from parameter optimization to world-model revision. The result is a paradigm for the next stage of world modeling: systems that not only simulate but also evolve by themselves, acquiring continual learning and self-revision capabilities. In an agent context, this means that the agent's world model is no longer a static artifact consumed at inference time; instead, it becomes a curious living component that diagnoses its own failures, designs targeted experiments to resolve ambiguities, and distills the resulting evidence into persistent model updates. Such self-evolving agents represent a qualitative shift from L2, where the model is a fixed tool for planning, to L3, where the model itself is the object of continual improvement driven by its own deployment experience.

Formal Definition {#subsec:l3_definition}
-----------------

L3 extends L2 from simulation within a fixed information regime to **closed-loop, evidence-driven model revision**. In summary, L3 systems are defined by realizing the entire design--execute--observe--reflect loop, in which new evidence is actively acquired to challenge and revise the model across iterations.

Formally, an L3 system maintains and updates a world-modeling stack: $$\mathcal{M}_t
\xrightarrow{\text{design}} a_t
\xrightarrow{\text{execute}} o_t
\xrightarrow{\text{observe}} d_t
\xrightarrow{\text{reflect}}
\mathcal{M}_{t+1}$$ where $\mathcal{M}_t$ denotes the current world-modeling stack at iteration $t$, $a_t$ the designed experiment or action, $o_t$ the raw outcome, and $d_t$ the distilled evidence used to update the stack (Figure `\ref{fig:l3_evolution_loop}`{=latex}).

Importantly, the presence of this loop alone is not sufficient to establish L3 capability. What distinguishes L3 systems is that evidence is translated into *persistent, reusable model updates* that are validated under regression checks, rather than remaining as transient, in-context adjustments. The model itself becomes the object of improvement, not merely a fixed substrate for planning.

This formulation closely mirrors the structure of scientific practice. A scientific community can be viewed as an L3 system operating over a shared model $\mathcal{M}_t$, consisting of established theories together with known anomalies. Under \`\`normal science" [@kuhn1962structure], the community continuously updates $\mathcal{M}_t \rightarrow \mathcal{M}_{t+1}$ through incremental refinements that preserve the underlying model class. When accumulated anomalies exceed the explanatory capacity of the current model, the same update process produces a more substantial transition, in which the structure of $\mathcal{M}_t$ itself is revised, corresponding to a paradigm shift. In this view, both gradual refinement and paradigm shifts are instances of the same evidence-driven update, differing only in scale.

In practice, the bottleneck of an L3 system is typically not generating candidate fixes but validating them safely. Multimodal critic models [@zhang2025critic] and regression-gated update pipelines [@ren2026aligning; @jimenez2023swebench; @yang2024sweagent] provide practical infrastructure for the observe and reflect stages. We cite these systems not as full L3 exemplars, but as components of an evolver design space once persistent update and validation mechanisms are in place.

![**The L3 evolution loop.** A full cycle proceeds through four stages: design, execute, observe, and reflect, producing a revised world-modeling stack $\mathcal{M}_{t+1}$.](assets/l3_loop.png){#fig:l3_evolution_loop width="85%"}

#### Revision triggers and evolution policy.

The reflect stage is responsible for deciding *when* and *how* the world model should be revised, in particular distinguishing between incremental improvement and structural change. In practice, this decision is driven by two types of signals. An *anomaly* denotes a mismatch between prediction and observation. While many anomalies can be absorbed through local adjustments, persistent anomalies that resist resolution within the current model class reveal an *epistemic gap*, indicating that the underlying representation or hypothesis space is insufficient. Resolving such gaps typically requires structural changes to the model, corresponding to a paradigm shift. In the philosophical framing of Section `\ref{sec:philosophy_l3}`{=latex}, anomalies that can be absorbed correspond to adjustments within Lakatos's \`\`protective belt" (learned parameters), while persistent anomalies that expose epistemic gaps demand changes to the \`\`hard core" (architecture, inductive biases). In operational terms, this induces a hierarchy of responses: small anomalies are handled via online adaptation within an episode, persistent anomalies trigger parameter updates that are distilled into the model, and epistemic gaps require structural modifications such as introducing new modules or expanding the hypothesis space. Duhem--Quine holism (Section `\ref{sec:philosophy_l3}`{=latex}) highlights that this attribution is inherently ambiguous: a mismatch between prediction and observation can often be explained by multiple components of the model, including representation, dynamics, or auxiliary assumptions. As a result, it is non-trivial to determine whether an anomaly can be resolved through local updates within the current model class or reflects a deeper epistemic gap that requires structural change to the underlying representation.

This difficulty is further shaped by the choice of representation. Epistemic gaps often involve missing or incorrect inductive biases or invariances in the model. While such gaps can, in principle, be addressed through structural changes in learned models, these changes are typically indirect: modifying architecture or training procedures does not guarantee a predictable change in the underlying invariances the model captures. In contrast, symbolic representations allow these invariances to be expressed and manipulated explicitly, as seen in scientific laws and principles. This suggests that while latent representations provide a flexible substrate for absorbing anomalies through parameter updates, resolving epistemic gaps may require representations that expose and manipulate invariances explicitly, as is standard in symbolic scientific models. We return to this tension between latent and symbolic representations in Section `\ref{sec:trends:representation}`{=latex}.

Distinction from L2 {#subsec:l2_vs_l3}
-------------------

We use three boundary conditions to mark L2 $\rightarrow$ L3, each corresponding to a transition in the design--execute--observe--reflect loop:

1.  **Active information expansion ($\mathcal{M}_t \rightarrow e_t$)**: the system designs experiments that actively probe uncertainty or challenge its current belief, rather than only optimizing within existing knowledge.

2.  **Autonomous execution and observation ($e_t \rightarrow d_t$)**: the system carries out experiments and acquires evidence through interaction, rather than relying on simulated or pre-existing data.

3.  **Belief revision under challenge ($d_t \rightarrow \mathcal{M}_{t+1}$)**: observations are used to reflect on and revise the model, including updating parameters, structure, or assets, enabling correction of prior assumptions.

The boundary conditions above can be unified as a single principle: whether the world model remains fixed or becomes plastic during deployment. This transition from L2 to L3 manifests in three aspects: whether the model can update its parameters and structure after deployment, how it accumulates new capabilities over time, and whether it passively consumes data or actively generates it through experimentation.

#### Fixed vs. adaptive.

An L2 simulator is typically fixed post-training. It can generate infinite rollouts based on its training data, but its core transition function $p_\theta(z_t \mid z_{t-1}, a_t)$ does not evolve; it explores the implications of its frozen knowledge. In contrast, an L3 system is adaptive post-deployment: it treats its own parameters or structure as a hypothesis to be updated, i.e. $\mathcal{M}_{t+1} \leftarrow \mathcal{M}_t + \text{Evidence}$.

#### Modes of growth.

L3 growth goes beyond simple data buffering and encompasses three different modes:

-   **Parameter update**: modifying weights via gradient descent or Bayesian updates on new evidence, e.g., online learning, continual RL fine-tuning, and Bayesian model updates.

-   **Architecture update**: dynamically adding new modules, experts, or capacity to handle complexity, for example, expanding the context window or allocating new memory slots.

-   **Hypothesis-space expansion**: extending the model class to represent explanations that were previously inexpressible. This corresponds to introducing new variables, mechanisms, or abstractions, shifting from \`\`I don't know which of these $k$ options is true" to \`\`the correct explanation is not among the current $k$ options." This is the most challenging mode and is closely tied to abduction and genuine scientific discovery.

#### Passive vs. active.

While L2 systems may support passive online learning (updating weights on a stream of incoming data) or decision simulating (Section `\ref{subsec:l2_science}`{=latex}), L3 is characterized by *active* *trial-and-error loop*. It does not just wait for data; it acts to generate data that maximizes information gain regarding a specific hypothesis or area of uncertainty. This active stance transforms the agent from a consumer of experience to a designer of experiments, a qualitative shift that connects directly to the philosophy of abduction and scientific method (Section `\ref{sec:philosophy_l3}`{=latex}). L3 should not be defined by closed-loop use in the generic planning sense; rather, it is defined by closing the evidence-to-revision loop, so that deployment outcomes are used to diagnose, update, and validate the world-modeling stack itself over successive iterations of use.

Examples and Applications {#subsec:l3_examples}
-------------------------

L3 is most tractable in domains that are highly instrumented, offer rapid feedback, and provide well-defined evaluation criteria. Empirical support for L3 is uneven across domains: autonomous science and other highly instrumented settings provide the clearest demonstrations, whereas social, code, and embodied environments remain partly empirical and partly prospective design space. We illustrate this landscape, together with the characteristic evidence signals and failure modes in each, across four governing-law regimes in Figure `\ref{fig:l3_domains}`{=latex}.

```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{figures/fig7_0423_1.pdf}
\caption{\textbf{L3 evolution across four governing-law regimes.} Each panel illustrates the design--execute--observe--reflect loop in a representative domain: (a)~Physical intelligence---adaptive probing revises contact dynamics; (b)~Social intelligence---norm drift triggers social-model revision; (c)~Digital intelligence---evaluator-driven program search with regression gates; (d)~Scientific intelligence---closed-loop autonomous experimentation at a synchrotron beamline.}
\label{fig:l3_domains}
\end{figure*}
```
#### Physical intelligence.

In embodied settings, L3 manifests as adaptive probing to infer and update dynamics models. When a robot encounters unexpected contact dynamics, such as a slippery surface or a deformable object, the system can actively execute diagnostic actions (small perturbations designed to disambiguate between hypotheses about the contact model) and use the resulting evidence to update its dynamics model. The anomaly signals in this regime are inherently physical: force/torque deviations, unexpected contact events, and discrepancies between predicted and observed end-effector trajectories provide quantitative evidence for model updates. Recent work demonstrates that robots can autonomously detect physical damage and re-train persistent self-models: @hu2025selfmodeling show that an egocentric visual self-model detects morphology changes via prediction-versus-observation mismatch and re-trains to recover locomotion. AdaptSim [@ren2023adaptsim] meta-learns an adaptation policy that iteratively revises simulation parameters from small amounts of real-world task performance data, closing the sim-to-real gap through evidence-driven simulation revision rather than fixed domain randomization, with each real-world deployment informing the next round of simulation updates (see Appendix `\ref{app:l3_examples}`{=latex} for a worked physical-intelligence example).

#### Digital intelligence.

Software and web environments are naturally suited to L3 because state is fully observable, actions are deterministically replayable, and regression testing provides a built-in validation gate. Evaluator-driven discovery loops exemplify this regime. @romera2024funsearch pair a pretrained LLM with an automated evaluator in an evolutionary loop: the LLM generates candidate programs, the evaluator scores them against a formal specification, and high-scoring solutions are fed back for further refinement. This loop discovered new constructions for the cap set problem (a long-standing open problem in combinatorics) and new bin-packing heuristics that outperform known baselines. The evaluator serves as an automated regression gate, a key L3 property, although the system realizes only the design and observe components (program generation and automated scoring) without active information expansion or persistent model revision. @alphaevolve2025 extend this evolutionary coding paradigm: by pairing LLM-generated program mutations with automated correctness evaluators, the system improved on Strassen's matrix multiplication algorithm after 56 years and solved 20% of open mathematical problems beyond the prior state of the art, illustrating the power of formal verification as an L3 gatekeeper in algorithmic domains. CodeIt [@butt2024codeit] closes a tighter loop: the LLM is fine-tuned from its own search trajectories via prioritized hindsight replay, so that the generative model itself (serving as an implicit world model of program space) persistently improves across tasks. The AI Scientist-v2 [@yamada2025aiscientistv2] pushes further into computational experiments by employing agentic tree search for experiment selection: the system autonomously formulates hypotheses, designs and executes experiments, analyzes results, and writes complete manuscripts. A VLM feedback loop iteratively refines figures and content. In 2025, this system produced an entirely AI-generated paper that passed peer review at an ICLR workshop. However, the system's experiments are computational (running ML training jobs), and its revision loop operates on paper quality rather than mechanistic understanding, illustrating the gap between L3 in well-instrumented computational domains and the harder challenge of genuine scientific discovery. In AUI [@lin2025computer], a Coder--Computer-Use Agent loop instantiates this principle in website: the Coder iteratively revises website implementations, while the CUA acts as an automated evaluator by executing task trajectories and verifying functional correctness (e.g., navigation success and task completion). The resulting feedback, grounded in executable interactions rather than static inspection, serves as a regression signal that guides subsequent code updates, forming a closed-loop optimization process aligned with L3 properties.

#### Social intelligence.

L3 in social domains requires revising the agent's social model when predicted behavior of other agents deviates from observed behavior, for example, when Theory-of-Mind predictions fail systematically or when social norms drift over time. This is currently the hardest regime for L3 because attribution is inherently ambiguous (a failed social prediction may reflect incorrect beliefs about the other agent's goals, an outdated norm model, or stochastic behavior) and because social experiments are ethically constrained. Early work on norm emergence and convention formation in multi-agent populations (Section `\ref{subsec:l2_social}`{=latex}) represents a preliminary step toward social L3, but persistent, validated revision of social world models from deployment evidence remains largely open. A preliminary step toward social L3 is the evolutionary synthesis of multi-agent governance rules: @kumar2026constitutions use LLM-driven genetic programming to evolve interpretable constitutions from societal stability scores, surpassing human-designed rules by 123%.

#### Scientific intelligence.

The most complete current examples of L3 come from autonomous science, where the full design--execute--observe--reflect loop is closed by instrumentation. The paradigm of autonomous closed-loop scientific discovery was established by Robot Scientist Adam [@sparkes2010robotscientist], the first machine to autonomously design experiments about gene function, execute them, observe the outcomes, and revise its model. Its successor system demonstrated closed-loop cycles of experiment design, execution, and model revision in yeast systems biology, accelerating biological model development [@coutant2019yeast]. CAMEO [@kusne2020cameo] implements closed-loop materials discovery via Bayesian active learning at a synchrotron beamline: the system predicts which phase a candidate composition will form, synthesizes it, characterizes the product via X-ray diffraction, updates its Bayesian belief model, and actively selects the next experiment to maximize information gain. Each experimental cycle takes seconds to minutes, and the system discovered a novel phase-change memory material without additional human training. A-Lab [@szymanski2023alab] extends this to fully autonomous synthesis: three robotic arms automate powder dosing, heating, and XRD characterization, with an active-learning algorithm generating improved recipes when targets fail. In 17 days of closed-loop operation, A-Lab performed 353 experiments and realized 36 compounds from 57 targets. Crucially, analysis of failed syntheses provided structured evidence to refine future synthesis strategies; the failures were not discarded but distilled into persistent knowledge. @striethkalthoff2024sdl extend the self-driving laboratory paradigm to distributed, multi-site operation: a delocalized SDL autonomously discovers novel organic laser emitters by iteratively updating a Bayesian surrogate from synthesis and characterization data across geographically separated facilities. BacterAI [@dama2023bacterai] demonstrates that L3 can operate with zero prior biological knowledge: the system iteratively designs and executes experiments to map microbial amino acid requirements, revising its metabolic model purely from experimental evidence. In computational chemistry, MOOSE-Chem [@yang2025moosechem] demonstrates that an LLM-based framework can rediscover chemistry hypotheses published in Nature and Science in 2024 using only pre-2024 literature, providing evidence that the hypothesis-generation component of the L3 loop is already feasible for natural-science domains. Its successor, MOOSE-Chem2 [@yang2025moosechem2], introduces hierarchical search over fine-grained hypothesis components to improve both precision and novelty of generated discoveries. Appendix `\ref{app:l3_examples}`{=latex} presents worked examples spanning all four regimes. Broader agentic systems are pushing the L3 loop further into biomedicine. Biomni [@huang2025biomni] provides a general-purpose biomedical AI agent that integrates over 100 tools and 59 databases spanning 25 subfields, enabling autonomous execution of tasks from causal gene prioritization to drug repurposing. BioLab [@jin2025biolab] extends this to end-to-end autonomous life-sciences research via a multi-agent system built on biological foundation models. OriGene [@zhang2025origene] demonstrates a self-evolving virtual disease biologist that autonomously discovers therapeutic targets through iterative hypothesis refinement. The AI co-scientist system [@gottweis2025coscientist] employs a generate--debate--evolve approach to hypothesis generation, with multi-agent tournament processes that have been validated in drug repurposing and epigenetic target discovery. Complementing these systems, @yang2026llmknowledge introduce a dynamic benchmark revealing that current LLMs still fall short on genuine biological knowledge derivation, underscoring the persistent gap between literature retrieval and true L3 revision that actually updates the underlying model.

```{=latex}
\begin{table*}[!t]\caption{\textbf{Representative L3 systems by governing-law regime.} Loop steps indicate which stages of the design, execute, observe, and reflect cycle each system realizes.}
\label{tab:l3_systems}

\small
\setlength{\tabcolsep}{0.9mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{l|cc|cccc}
\toprule
\textbf{System} & \textbf{Links} & \textbf{Design} & \textbf{Execute} & \textbf{Observe} & \textbf{Reflect} \\
\midrule

\rowcolor{gray!15}
\textit{Physical World} \\
\shortstack[l]{AdaptSim~\citep{ren2023adaptsim}}
& \paperlink{https://arxiv.org/abs/2302.04903}
& \githublink{https://github.com/irom-princeton/AdaptSim}
& \cmark & \cmark & \cmark & \xmark \\
\shortstack[l]{Self-Modeling~\citep{hu2025selfmodeling}}
& \paperlink{https://arxiv.org/abs/2207.03386}
& \githublink{https://github.com/H-Y-H-Y-H/Egocentric_VSM}
& \cmark & \cmark & \cmark & \cmark \\
\midrule

\rowcolor{gray!15}
\textit{Digital World} \\
\shortstack[l]{FunSearch~\citep{romera2024funsearch}}
& \paperlink{https://doi.org/10.1038/s41586-023-06924-6}
& \githublink{https://github.com/google-deepmind/funsearch}
& \cmark & \cmark & \cmark & \xmark \\
\shortstack[l]{CodeIt~\citep{butt2024codeit}}
& \paperlink{https://arxiv.org/abs/2402.04858}
& \githublink{https://github.com/Qualcomm-AI-research/codeit}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{SWE-agent~\citep{yang2024sweagent}}
& \paperlink{https://arxiv.org/abs/2405.15793}
& \githublink{https://github.com/princeton-nlp/SWE-agent}
& \cmark & \cmark & \cmark & \xmark \\
\shortstack[l]{AUI~\citep{lin2025computer}}
& \paperlink{https://arxiv.org/abs/2511.15567}
& \githublink{https://github.com/showlab/AUI}
& \cmark & \cmark & \cmark & \xmark \\
\shortstack[l]{AlphaEvolve~\citep{alphaevolve2025}}
& \paperlink{https://arxiv.org/abs/2506.13131}
& \githublink{https://github.com/google-deepmind/alphaevolve_results}
& \cmark & \cmark & \cmark & \xmark \\
\midrule

\rowcolor{gray!15}
\textit{Social World} \\
\shortstack[l]{Evolving Const.~\citep{kumar2026constitutions}}
& \paperlink{https://arxiv.org/abs/2602.00755}
& {---}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{AgentSociety~\citep{piao2025agentsociety}}
& \paperlink{https://arxiv.org/abs/2502.08691}
& \githublink{https://github.com/tsinghua-fib-lab/AgentSociety}
& \cmark & \cmark & \cmark & \xmark \\
\midrule

\rowcolor{gray!15}
\textit{Scientific World} \\
\shortstack[l]{Robot Scientist~\citep{sparkes2010robotscientist}}
& \paperlink{https://doi.org/10.1186/1759-4499-2-1}
& {---}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{CAMEO~\citep{kusne2020cameo}}
& \paperlink{https://arxiv.org/abs/2006.06141}
& \githublink{https://github.com/KusneNIST/CAMEO_NComm}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{Yeast Cycles~\citep{coutant2019yeast}}
& \paperlink{https://doi.org/10.1073/pnas.1900548116}
& {---}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{BacterAI~\citep{dama2023bacterai}}
& \paperlink{https://doi.org/10.1038/s41564-023-01376-0}
& \githublink{https://github.com/jensenlab/BacterAI}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{A-Lab~\citep{szymanski2023alab}}
& \paperlink{https://doi.org/10.1038/s41586-023-06734-w}
& {---}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{SDL Lasers~\citep{striethkalthoff2024sdl}}
& \paperlink{https://doi.org/10.1126/science.adk9227}
& \githublink{https://github.com/aspuru-guzik-group/acdc_laser}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{AI Scientist~\citep{lu2024aiscientist}}
& \paperlink{https://arxiv.org/abs/2408.06292}
& \githublink{https://github.com/SakanaAI/AI-Scientist}
& \cmark & \cmark & \cmark & \cmark \\
% \shortstack[l]{MOOSE-Chem~\citep{yang2025moosechem}}
% & \paperlink{https://arxiv.org/abs/2410.07076}
% & \githublink{https://github.com/ZonglinY/MOOSE-Chem}
% & \cmark & \xmark & \cmark & \xmark \\
% \shortstack[l]{MOOSE-Chem2~\citep{yang2025moosechem2}}
% & \paperlink{https://arxiv.org/abs/2505.19209}
% & \githublink{https://github.com/ZonglinY/MOOSE-Chem2}
% & \cmark & \xmark & \cmark & \xmark \\
\shortstack[l]{Biomni~\citep{huang2025biomni}}
& \paperlink{https://doi.org/10.1101/2025.05.30.656746}
& \githublink{https://github.com/snap-stanford/Biomni}
& \cmark & \cmark & \cmark & \xmark \\
\shortstack[l]{BioLab~\citep{jin2025biolab}}
& \paperlink{https://doi.org/10.1101/2025.09.03.674085}
& {---}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{OriGene~\citep{zhang2025origene}}
& \paperlink{https://doi.org/10.1101/2025.06.03.657658}
& \githublink{https://github.com/GENTEL-lab/OriGene}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{Co-Scientist~\citep{gottweis2025coscientist}}
& \paperlink{https://arxiv.org/abs/2502.18864}
& {---}
& \cmark & \cmark & \cmark & \cmark \\
\shortstack[l]{AI Scientist v2~\citep{yamada2025aiscientistv2}}
& \paperlink{https://arxiv.org/abs/2504.08066}
& \githublink{https://github.com/SakanaAI/AI-Scientist-v2}
& \cmark & \cmark & \cmark & \cmark \\
\bottomrule
\end{tabular}
\end{table*}
```
Table `\ref{tab:l3_systems}`{=latex} summarizes representative L3 systems across the four governing-law regimes, indicating which stages of the design--execute--observe--reflect loop each system realizes.

#### Evidence quality and falsifiability.

The quality of evolution depends on the quality of evidence. Table `\ref{tab:l3_evidence}`{=latex} organizes the revision signals that trigger L3 model updates in each governing-law regime: what the agent detects, why it indicates the current model is wrong, and how falsifiable the signal is.

::: {#tab:l3_evidence}
+------------------------------------------------------------------------------------+------------------------------------------------------+
| **Revision Signal**                                                                | **Trigger Condition**                                |
+:===================================================================================+:=====================================================+
| `\rowcolor{gray!15}`{=latex} Physical World                                        |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| ```{=latex}                                                                        | Trajectory violates joint limits or collision bounds |
| \addlinespace                                                                      |                                                      |
| ```                                                                                |                                                      |
| `\rowcolor{green!20}`{=latex} Kinematic infeasibility [@ren2023adaptsim]           |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{yellow!30}`{=latex} Contact dynamics mismatch [@hu2025selfmodeling]     | Force/torque deviates from predicted contact model   |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{red!25}`{=latex} Morphology change [@hu2025selfmodeling]                | Visual self-model diverges from observed body state  |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{gray!15}`{=latex} Social World                                          |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| ```{=latex}                                                                        |                                                      |
| \addlinespace                                                                      |                                                      |
| ```                                                                                |                                                      |
| `\rowcolor{green!20}`{=latex} Interventional inconsistency [@piao2025agentsociety] |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| shift                                                                              |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{yellow!30}`{=latex} Global behavioral drift [@kumar2026constitutions]   |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| response                                                                           |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{red!25}`{=latex}                                                        |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
|  [@taubenfeld2024biases]                                                           | Agent behavior deviates from demographic priors      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{gray!15}`{=latex} Digital World                                         |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| ```{=latex}                                                                        | Previously passing test fails post-update            |
| \addlinespace                                                                      |                                                      |
| ```                                                                                |                                                      |
| `\rowcolor{green!20}`{=latex} Regression detection [@romera2024funsearch]          |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{yellow!30}`{=latex} Execution outcome mismatch [@lin2025computer]       | Predicted state differs from actual execution result |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{red!25}`{=latex} Task completion failure [@butt2024codeit]              | Action sequence fails to achieve specified goal      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{gray!15}`{=latex} Scientific World                                      |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| ```{=latex}                                                                        | Experiment contradicts predicted outcome             |
| \addlinespace                                                                      |                                                      |
| ```                                                                                |                                                      |
| `\rowcolor{green!20}`{=latex} Hypothesis falsification [@szymanski2023alab]        |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{yellow!30}`{=latex} Prediction--measurement gap [@kusne2020cameo]       | Surrogate output diverges from measurement           |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| `\rowcolor{red!25}`{=latex} Epistemic gap detection [@dama2023bacterai]            |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+
| scope                                                                              |                                                      |
+------------------------------------------------------------------------------------+------------------------------------------------------+

: **Revision signals for L3 evolution by governing-law regime.** Row color encodes *within-domain* falsifiability (not comparable across regimes): [High]{style="background-color: green!20"}, [Medium]{style="background-color: yellow!30"}, [Low]{style="background-color: red!25"}.
:::

A useful principle is to prefer falsifiable evidence (Section `\ref{sec:philosophy_l3}`{=latex}). A screenshot combined with a DOM snapshot, error code, and action sequence is reproducible and refutable; \`\`I think the page didn't load" is not. Human feedback should not be treated as a single falsifiability class: subjective or preference feedback is weakly falsifiable, whereas expert diagnostic feedback can be strongly falsifiable when its claims are subsequently checked by tests, experiments, or structured evaluation. Evolver's progress depends on making lessons verifiable, and reversible when wrong. This requirement connects directly to the anomaly and epistemic-gap triggers defined in Section `\ref{subsec:l3_definition}`{=latex}: an anomaly is actionable only when the deviation between prediction and observation can be quantified from recorded evidence, and an epistemic gap is recognizable only when the system can demonstrate that no existing hypothesis adequately accounts for the observation. In large-scale deployments, evidence must also be compressible and indexable. Practical systems maintain multi-resolution evidence: a compact error category combined with a state fingerprint and diff summary for fast retrieval, together with pointers to heavier artifacts (screenshots, DOM snapshots, full logs) for deep audits. Evidence quality is also tightly coupled to privacy and safety constraints: an Evolver pipeline must separate what is stored persistently (sanitized logs, hashed fingerprints) from what is kept transient or behind access controls, protecting sensitive data while retaining an audit trail [@xie2024osworld; @yang2025macosworld].

Continuous self-improvement also introduces governance challenges, including benchmark overfitting, knowledge contamination, and misattribution of failures to wrong components. These risks and the practical measures to mitigate them (versioning, rollback, regression gates) are discussed as open problems in Section `\ref{sec:trends}`{=latex}.

L3 in Context: Maturity, Governance, and Outlook {#subsec:l3_context}
------------------------------------------------

Having established the L3 evolution loop, its domain instantiations, and the role of evidence quality, we now examine its practical status and implications. This subsection addresses two complementary questions: *maturity*, i.e., where L3 systems have been successfully realized across governing-law regimes; and *governance*, i.e., what risks arise from persistent, automated model revision. Together, these perspectives characterize L3 both as a modeling paradigm and as a deployed system that must evolve reliably under real-world constraints.

#### Maturity across different domains.

We summarize maturity across the four governing-law regimes:

1.  **Scientific (Established).** The most mature regime, offering fast, structured feedback, unambiguous anomaly signals (hypothesis falsification), and well-defined revision targets (surrogate model parameters, synthesis recipes) [@kusne2020cameo; @szymanski2023alab; @sparkes2010robotscientist; @dama2023bacterai]. Primary bottleneck: instrument access and real-data budget.

2.  **Digital (Partial).** Regression testing provides an automated validation gate, but many systems still lack the active information-expansion boundary condition [@romera2024funsearch; @alphaevolve2025; @butt2024codeit]. Primary bottleneck: active experiment design is often absent.

3.  **Physical (Emerging).** Promising but limited by attribution difficulty: a failed manipulation can stem from perception, dynamics, actuation, or environmental change, and isolating the brittle component requires careful experimental design [@ren2023adaptsim; @hu2025selfmodeling]. Primary bottleneck: failure attribution across perception, dynamics, and actuation.

4.  **Social (Aspirational).** Social experiments are ethically constrained, attribution is inherently ambiguous, and behavioral ground truth is noisy [@kumar2026constitutions]. Primary bottleneck: attribution ambiguity and ethical constraints on social experimentation.

#### Governance challenges.

Three governance risks arise specifically from persistent, automated model revision. *Benchmark overfitting* occurs when the regression gate is too close to the training distribution; the system learns to pass tests rather than improve genuinely. *Knowledge contamination* occurs when the revision loop incorporates evidence that is itself biased or adversarially constructed, silently degrading the model on OOD inputs. *Misattribution cascades* occur when a fix for one failure mode inadvertently degrades another component; without comprehensive regression suites, the net effect of an update can be negative. Mitigations include held-out probe sets that are refreshed independently of training data, canary deployment that surfaces regressions before full rollout, and causal ablations that isolate the contribution of each update.

#### Relationship to Sections `\ref{sec:evaluation}`{=latex} and `\ref{sec:implementation}`{=latex}.

From an evaluation perspective (Section `\ref{sec:evaluation}`{=latex}), assessing L3 requires protocols that go beyond single-episode accuracy: the key metric is whether the system improves across revision cycles $k$ without regressing on held-out probes. From an implementation perspective (Section `\ref{sec:implementation}`{=latex}), L3 places the heaviest demands on the system stack (persistent storage, replay infrastructure, regression harnesses, and rollback mechanisms) that are often underspecified in current architectures. Building toward L3 therefore means investing in evaluation infrastructure as much as in model capacity.

Evaluations {#sec:evaluation}
===========

Evaluating world models for agentic AI requires moving beyond standard generative metrics toward decision-centric protocols organized around three boundary conditions: long-horizon coherence, intervention sensitivity, and constraint consistency. This section first motivates this shift (Section `\ref{subsec:eval_decision}`{=latex}), then maps the benchmark landscape by governing-law regime and provides detailed evaluation protocols for each condition (Section `\ref{subsec:eval_boundary}`{=latex}), and finally shows how the *same* benchmark can test L1, L2, or L3 depending on the evaluation protocol (Section `\ref{subsec:eval_levels}`{=latex}). World-model-specific evaluations show that even frontier models still suffer from substantial capability gaps, while no single benchmark fully captures the space of interest. For further clarification, Appendix `\ref{app:eval_extended}`{=latex} provides a capability coverage matrix and a Minimal Reproducible Evaluation Package (MREP).

From prediction-centric to decision-centric evaluation {#subsec:eval_decision}
------------------------------------------------------

While standard generative metrics such as Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), SSIM, and per-pixel reconstruction loss capture perceptual quality, they are at best weak indicators [@brooks2024sora; @deepmind2025genie3] of agentic capability and offer limited predictive power for the downstream decisions an agent must ultimately make once embedded in the real-world environment.

As a result, a world model can generate visually convincing rollouts while still breaking down during planning because of hallucinated object dynamics, action-insensitive transitions, or subtle physics violations. These errors are often invisible to distribution-level metrics but devastating for downstream decision-making.

The root cause is a mismatch between what is measured and what matters. The object of evaluation should not be a single-step prediction $p_\theta(z_t\mid z_{t-1},a_t)$ in isolation, but the **trajectory-level rollout** $$\hat p(\tau \mid z_0, a_{1:H}, c), \qquad \tau=(z_1,\ldots,z_H),$$ and specifically whether this rollout is reliable enough for a planner to act on (Section `\ref{sec:l2}`{=latex}). Aggregate measures such as mean success rates further obscure the picture by masking high variance across task instances [@agarwal2021statistical; @henderson2018deep].

We therefore organize evaluation around the three **boundary conditions** that mark L1$\to$L2 (Section `\ref{subsec:l2_requirements}`{=latex}):

1.  **Long-horizon coherence:** rollouts remain decision-usable over $H$ steps rather than degrading via error.

2.  **Intervention sensitivity:** counterfactual edits (action or premise changes) induce stable and directionally meaningful trajectory changes.

3.  **Constraint consistency:** generated futures respect the governing laws of the target regime (Section `\ref{sec:l2}`{=latex}).

These three conditions hold across the four governing-law regimes introduced in Section `\ref{sec:l2}`{=latex}, and together they give a common framework within which we can organize the evaluation protocols, benchmark analyses, and reporting standards described in the remainder of this section.

World-model evaluation is ultimately meaningful only insofar as it reflects downstream decision quality. Benchmarks that capture long-horizon coherence, intervention sensitivity, or constraint consistency are valuable not merely as diagnostics, but because these properties should translate into better plan selection, fewer costly invalid actions, and greater task success under distribution shift. The relevant bridge is therefore not \`\`Does the model look realistic?" but \`\`Does improved model validity change what the agent chooses, and does that shift in choice in turn improve real-world task outcomes?"

Two aggregate metrics operationalize these conditions for downstream decision-making. The **Action Success Rate** (ASR) measures how often a planner that uses the world model's rollouts to select actions achieves the task goal in the real environment: $$\mathrm{ASR} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\bigl[\text{task}_i \text{ succeeds under policy derived from } \hat{p}\bigr].$$ The **Counterfactual Outcome Deviation** (COD) measures intervention sensitivity by comparing rollout outcomes under two policies $a^{(1)}_{1:H}$ and $a^{(2)}_{1:H}$ that differ at a single intervention step $k$: $$\mathrm{COD}(k) = \mathbb{E}\bigl[d\bigl(\hat{z}^{(1)}_H,\, \hat{z}^{(2)}_H\bigr)\bigr],$$ where $d$ is a task-relevant distance (e.g., goal-state distance in physical tasks, edit distance in software tasks). When COD is low, a world model is largely unresponsive to changes in action, which makes it uninformative for counterfactual planning. Together, ASR and COD provide a more direct link between world-model quality and downstream agentic performance: ASR assesses whether the model supports good decisions, whereas COD assesses whether the model responds in a meaningful way to action-level interventions.

Evaluating the three boundary conditions {#subsec:eval_boundary}
----------------------------------------

No single benchmark can evaluate an agent's mastery of all world rules. Because benchmark selection heavily shapes which boundary conditions are actually tested, we first sketch the landscape by governing-law regime.

#### Benchmark landscape by laws.

In physical-world domains, the **Atari 100k** benchmark [@kaiser2019simple] tests sample-efficient world-model learning under a strict 100k-step budget across 26 games, while **Meta-World** [@yu2020metaworld] provides 50 distinct robotic manipulation tasks for multi-task and meta-RL evaluation. **CALVIN** [@mees2022calvin] evaluates language-conditioned long-horizon manipulation with 24 hours of teleoperated play data and 20K language directives. **RoboCasa** [@nasiriany2024robocasa] and its successor RoboCasa365 [@nasiriany2026robocasa365] test long-horizon manipulation and physical stability, while **BuilderBench** [@ghugare2025builderbench] tests structural stability under physical load. **ManiSkill3** [@tao2024maniskill3] and **RLBench** [@james2020rlbench] provide large-scale demonstrations for generalizable manipulation. In autonomous driving, **nuScenes** [@caesar2020nuscenes] provides the standard multimodal benchmark with full 360-degree sensor suite across 1000 scenes. The **Habitat** series [@savva2019habitat; @szot2021habitat20; @puig2023habitat3; @yokoyama2024hm3d] remains the standard for 3D navigation and rearrangement, **iGibson 2.0** and **BEHAVIOR-1K** [@li2021igibson2; @li2024behavior1k] extend to household activities depending on object states and long-horizon semantics, and **VBench** [@huang2023vbench] evaluates video generation for physical compliance. In digital-world domains, **OSWorld** [@xie2024osworld] and **macOSWorld** [@yang2025macosworld] test GUI grounding and receipt parsing on desktop operating systems; **SWE-bench** [@jimenez2023swebench] evaluates cross-file software engineering; **WebArena** [@zhou2023webarena] and **Mind2Web** [@deng2023mind2web] test web interactions; and **AppAgent** [@zhang2025appagent] and **AndroidWorld** [@rawles2024androidworld] extend digital constraints to mobile operating systems. **GameWorld** [@ouyang2026gameworld] introduces verfiable evaluation for multimodal agents. In social-world domains, Theory of Mind benchmarks have systematically mapped LLM capabilities: **ToMi** [@le2019tomi] established properly balanced false-belief testing, **BigToM** [@gandhi2023bigtom] introduced causal templates for belief inference, and **OpenToM** [@xu2024opentom] expanded to psychological states where LLMs fall notably short. **Sotopia** [@zhou2024sotopia] provides multi-dimensional social simulation with negotiation and norm compliance; **AgentBench** [@liu2023agentbench] offers broad cross-domain assessment including role-playing; and game-based environments such as Werewolf and Avalon test deception, trust, and strategic social reasoning. In scientific-world domains, **ScienceWorld** [@wang2022scienceworld] tests elementary scientific reasoning; **DiscoveryBench** [@majumder2024discoverybench] evaluates hypothesis generation and verification; **ChemCrow** [@bran2024augmenting] assesses chemical synthesis under strict validity constraints; and **FutureX** [@zeng2025futurex] tests evidence-based prediction from dynamic information streams. Finally, open-world environments such as **Minecraft** (via MCU or Voyager) [@zheng2023mcu; @wang2023voyager], **Crafter** [@stanic2023learning], and **NetHack** [@kurenkov2023katakomba] evaluate the composition of skills across multiple governing laws simultaneously; e.g. combining physical combat with resource economics and long-horizon planning in procedurally generated worlds.

Table `\ref{tab:benchmark_summary}`{=latex} maps compact benchmark anchors to their primary governing-law regime, capability-level coverage, and key evaluation metrics; the prose above gives the broader landscape. Detailed evaluation protocols for each boundary condition (including counterfactual divergence testing for intervention sensitivity, degradation curves for long-horizon coherence, and regime-specific constraint verification) appear in Appendix `\ref{app:eval_extended}`{=latex}.

```{=latex}
\begin{table*}[!t]\caption{\textbf{Representative benchmark anchors by governing-law regime with capability-level coverage and core evaluation metrics.} The table is a compact comparison set; Section~\ref{sec:evaluation} discusses additional benchmarks in prose. \cmark\ = supported; \xmark\ = not supported.}
\label{tab:benchmark_summary}

\small
\setlength{\tabcolsep}{0.9mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{l|cc|ccc|l}
\toprule
\textbf{Benchmark} & \textbf{Links} & \textbf{L1} & \textbf{L2} & \textbf{L3} & \textbf{Core Metrics} \\
\midrule

\rowcolor{gray!15}
\textit{Physical World} \\
\shortstack[l]{Atari 100k~\citep{kaiser2019simple}}
& \paperlink{https://arxiv.org/abs/1903.00374}
& {---}
& \cmark & \cmark & \xmark
& Human-norm. score \\
\shortstack[l]{Meta-World~\citep{yu2020metaworld}}
& \paperlink{https://arxiv.org/abs/1910.10897}
& \githublink{https://github.com/Farama-Foundation/Metaworld}
& \cmark & \cmark & \xmark
& Success rate \\
\shortstack[l]{CALVIN~\citep{mees2022calvin}}
& \paperlink{https://arxiv.org/abs/2112.03227}
& \githublink{https://github.com/mees/calvin}
& \cmark & \cmark & \xmark
& Lang-cond. success \\
\shortstack[l]{RoboCasa~\citep{nasiriany2024robocasa}}
& \paperlink{https://arxiv.org/abs/2406.02523}
& \githublink{https://github.com/robocasa/robocasa}
& \cmark & \cmark & \xmark
& Task completion \\
\shortstack[l]{nuScenes~\citep{caesar2020nuscenes}}
& \paperlink{https://arxiv.org/abs/1903.11027}
& \githublink{https://github.com/nutonomy/nuscenes-devkit}
& \cmark & \cmark & \xmark
& mAP, NDS \\
\midrule

\rowcolor{gray!15}
\textit{Digital World} \\
\shortstack[l]{OSWorld~\citep{xie2024osworld}}
& \paperlink{https://arxiv.org/abs/2404.07972}
& \githublink{https://github.com/xlang-ai/OSWorld}
& \cmark & \cmark & \xmark
& Task success \\
\shortstack[l]{SWE-bench~\citep{jimenez2023swebench}}
& \paperlink{https://arxiv.org/abs/2310.06770}
& \githublink{https://github.com/princeton-nlp/SWE-bench}
& \cmark & \cmark & \cmark
& Resolve rate \\
\shortstack[l]{WebArena~\citep{zhou2023webarena}}
& \paperlink{https://arxiv.org/abs/2307.13854}
& \githublink{https://github.com/web-arena-x/webarena}
& \cmark & \cmark & \xmark
& Task success \\
\midrule

\rowcolor{gray!15}
\textit{Social World} \\
\shortstack[l]{Sotopia~\citep{zhou2024sotopia}}
& \paperlink{https://arxiv.org/abs/2310.11667}
& \githublink{https://github.com/sotopia-lab/sotopia}
& \cmark & \cmark & \xmark
& Social score \\
\shortstack[l]{FANToM~\citep{kim2023fantom}}
& \paperlink{https://arxiv.org/abs/2310.15421}
& \githublink{https://github.com/skywalker023/fantom}
& \cmark & \xmark & \xmark
& False-belief acc. \\
\shortstack[l]{Hi-ToM~\citep{wu2023hitom}}
& \paperlink{https://arxiv.org/abs/2310.16755}
& \githublink{https://github.com/ying-hui-he/Hi-ToM_dataset}
& \cmark & \xmark & \xmark
& Belief acc. \\
\midrule

\rowcolor{gray!15}
\textit{Scientific World} \\
\shortstack[l]{ScienceWorld~\citep{wang2022scienceworld}}
& \paperlink{https://arxiv.org/abs/2203.07540}
& \githublink{https://github.com/allenai/ScienceWorld}
& \cmark & \cmark & \xmark
& Task completion \\
\shortstack[l]{DiscoveryBench~\citep{majumder2024discoverybench}}
& \paperlink{https://arxiv.org/abs/2407.01725}
& \githublink{https://github.com/allenai/discoverybench}
& \cmark & \cmark & \cmark
& Hypothesis acc. \\
\bottomrule
\end{tabular}
\end{table*}
```
Differentiating L1, L2, and L3 via evaluation protocol {#subsec:eval_levels}
------------------------------------------------------

Importantly, the *same* benchmark can evaluate different capability levels depending on protocol. The level is fixed not by the benchmark but by what the protocol demands: L1 tests local prediction, L2 tests decision-usable simulation under the three boundary conditions, and L3 tests whether the system can revise itself from evidence (Section `\ref{subsec:l3_definition}`{=latex}). The examples below make this concrete, one per governing-law regime.

#### RoboCasa (physical world).

At L1, the benchmark reduces to predicting the next end-effector position given current state and action, measured by single-step position error. Elevating the protocol to L2 requires executing a full kitchen task (e.g., a pick-place-heat sequence) under mid-task perturbations such as object displacement or drawer obstruction; the relevant metrics shift to long-horizon success rate, catastrophic action fraction, and recovery rate after perturbation. An L3 protocol would further demand that the agent, after repeated failures on a novel kitchen layout, distills a persistent grasp strategy (e.g., \`\`this handle requires a top-down approach") whose benefit carries over to subsequent trials. In practice, most robotic manipulation systems still report L1-style single-task success rates, and perturbation injection remains nonstandard.

#### OSWorld and SWE-bench (digital world).

These benchmarks span the L1--L2 boundary. Single-step click prediction (OSWorld) or single-line code completion (SWE-bench) constitutes L1 evaluation. L2 demands cross-file issue resolution under injected failures (network timeouts, unexpected pop-ups, out-of-distribution states), tracked via long-horizon consistency and catastrophic action fraction [@xie2024osworld; @jimenez2023swebench]. The leap to L3 would require the system to generate durable artifacts: a reproduction script that becomes a regression test (SWE-bench), or a reusable installation procedure distilled from a failed attempt (OSWorld). Today, leaderboard systems primarily operate at L1--L2; L3-style asset generation is reported anecdotally across isolated case studies but not yet evaluated systematically or at scale.

#### Sotopia (social world).

Social simulation introduces a distinctive challenge: the \`\`perturbation" is another agent's strategy shift rather than a physics disturbance. An L1 evaluation measures next-turn prediction accuracy or perplexity. For L2, one agent's strategy is changed mid-conversation (counterfactual injection), and the protocol tracks goal completion under perturbation, commitment consistency, and whether social outcomes shift appropriately [@zhou2024sotopia]. L3 evaluation would require the agent to distill, after repeated negotiation failures, a new social strategy or norm-handling rule that persists and transfers to structurally similar scenarios. Existing social benchmarks rarely inject such counterfactual perturbations, leaving L3-style social strategy evolution largely unexplored in current negotiation and role-playing settings.

#### ScienceWorld and DiscoveryBench (scientific world).

Of the four regimes, autonomous science is closest to a genuine L3 evaluation paradigm. At L1, the task is predicting the outcome of a single experimental action (e.g., \`\`what happens when acid is added to base?"). L2 protocols require designing and executing a multi-step experimental sequence while maintaining causal coherence, measured by sequence validity, hypothesis-consistent action rate, and robustness to misleading observations [@wang2022scienceworld; @majumder2024discoverybench]. The critical L3 step is hypothesis revision: when experimental evidence falsifies a prediction, the system must update its belief structure and avoid previously falsified paths. Closed-loop systems such as CAMEO already demonstrate this kind of evidence-driven model revision in laboratory settings.

#### Evaluation gaps and coverage.

The vast majority of current systems are evaluated only at L1 (single-step accuracy or end-to-end success rate without perturbation injection). L2-style evaluation protocols (counterfactual injection, degradation curves, constraint-violation detection) exist in principle and are demonstrated by individual benchmarks, but are not yet standard practice across the field. L3 evaluation infrastructure (regression suites, asset validation gates, cross-episode improvement tracking) is essentially nonexistent outside autonomous science. Closing this gap is a prerequisite for claims about world-modeling capability.

Benchmarks and coverage analysis {#subsec:eval_benchmarks}
--------------------------------

A growing line of work asks whether current systems genuinely learn world models or merely surface correlations. WorldSimBench [@qin2025worldsimbench] and WorldModelBench [@fan2025worldmodelbench] reveal that perceptual realism and action-conditioned fidelity can diverge sharply, while @vafa2024evaluating and @kang2025howfar demonstrate fundamental coherence and consistency failures that persist at scale. No single benchmark covers all capabilities; a capability coverage matrix mapping benchmarks to boundary conditions and regimes is provided in Appendix `\ref{app:eval_extended}`{=latex}. We use four qualitative coverage labels in the capability matrix. **Strong (S)** means the benchmark directly and intentionally tests the capability through explicit task design and scoring. **Medium (M)** means the capability is exercised in a substantial but partial or indirect way. **Weak (W)** means the benchmark offers only incidental evidence about the capability. **--** means the capability is not meaningfully tested. These labels are judgment-based but are assigned according to task design, scoring visibility, and whether failure on the capability can be unambiguously detected from benchmark traces. Broader multimodal evaluation resources such as HSSBench can complement this landscape by probing humanities- and social-science reasoning, although they do not directly evaluate decision-usable social-state rollouts or other world-model-specific boundary conditions [@kang2025hssbench]. We also propose the Minimal Reproducible Evaluation Package (MREP), a community standard for version locking, trace logging, failure taxonomy, tail statistics, and boundary condition mapping, detailed in Appendix `\ref{app:eval_extended}`{=latex}. Seen this way, MREP is not only an evaluation proposal for individual papers but also the minimal evidence infrastructure required to sustain any credible L3 gatekeeping loop in practice.

Open Challenges in Evaluation {#subsec:eval_open}
-----------------------------

Evaluation challenges remain even when benchmark suites and logging infrastructure improve. Some are methodological, including benchmark saturation and evaluation gaming; others are infrastructural, including the cost and variability of human evaluation and the lack of systematic meta-evaluation. *Benchmark saturation*: as top systems converge on near-ceiling performance, the discriminative power of existing benchmarks decreases. *Evaluation gaming*: systems can optimize for benchmark-specific artifacts rather than genuine capability. *Human evaluation*: for social-world and open-ended scenarios where automated metrics are unreliable, human judgment remains necessary. *Meta-evaluation*: whether an evaluation protocol itself is valid is a question that is rarely addressed systematically across current world-model benchmarks.

Architectural and Computational Considerations {#sec:implementation}
==============================================

The value of a taxonomy is not categorization for its own sake, but guiding system design. This section decomposes world-model implementations along three architectural axes, namely representation, dynamics, and control interface (Section `\ref{subsec:impl_blocks}`{=latex}), and examines how the governing-law regime constrains which combinations are viable in practice (Section `\ref{subsec:impl_tradeoffs}`{=latex}). Deploying these systems raises cross-cutting engineering challenges: the choice between end-to-end and modular training, latency-compute tradeoffs, sim-to-real transfer, and graceful degradation under model uncertainty. A learned world model amortizes simulation cost into a fixed computation graph at inference time, whereas explicit simulation typically scales more directly with the number of entities, interactions, solver steps, or horizon length. This does not mean neural inference is literally $O(1)$ in every relevant variable: its cost still depends on model size, input resolution, sequence length, and rollout depth. The practical advantage is instead that learned dynamics can offer near-constant-cost approximations with respect to aspects of system complexity that would otherwise require increasingly expensive explicit simulation. Efficiency techniques matter here not as generic deployment tricks but because they interact differently with the three capability levels. For L1 systems, compression mainly trades off against one-step predictive accuracy. For L2 systems, memory and rollout efficiency directly affect achievable horizon, counterfactual branching, and thus long-horizon coherence. For L3 systems, the same efficiency choices affect whether regression-gated update loops are cheap enough to run continuously in deployment. Scaling further demands efficiency techniques: few-step distillation for real-time planning, quantization and pruning under the constraint that compounding errors amplify even minor per-step degradation, and KV cache compression for long-horizon autoregressive dynamics. A more extended treatment of these deployment and efficiency topics, together with concrete compute and latency measurements, appears in Appendix `\ref{app:impl_extended}`{=latex}.

Architectural building blocks: representation, dynamics, and control {#subsec:impl_blocks}
--------------------------------------------------------------------

Building a world-model system requires choosing components along three axes (Table `\ref{tab:arch_building_blocks}`{=latex}). Each choice carries distinct tradeoffs that determine which capability level (L1/L2/L3) the resulting system can reach and in which governing-law regime the resulting design will be most effective along each of these three axes.

#### Representation.

At one extreme, symbolic or programmatic states (e.g., VirtualHome [@puig2018virtualhome]) offer interpretability and enable hard constraint enforcement, but demand heavy manual engineering and cover only pre-specified state spaces; they are best evaluated by success rate and error-branch coverage. At the other extreme, latent continuous representations, such as the RSSM in DreamerV3 [@hafner2023dreamerv3] and V-JEPA2 [@meta2025vjepa2], handle high-dimensional multimodal inputs with relatively little hand-designed structure. Their weakness is that, over long horizons, they are more susceptible to semantic drift and state aliasing, making long-horizon consistency and failure attribution especially important for evaluation. VL-JEPA [@chen2025vl] develop a joint embedding predictive architecture which predicts the continuous embeddings of the target text. VLog [@lin2025vlog] use a learnable token to retrieve the narration then serve as video-centric vocabulary in the long video understanding. Between these two extremes lie structured 3D representations, including occupancy models such as RoboOccWorld [@zhang2025robooccworld] and point-flow models such as PointWorld [@huang2026pointworld]. They are appealing because they fit physical constraints more naturally, but this advantage often comes with reconstruction and computational bottlenecks. As a result, reachability and stability become particularly important in evaluation. Finally, discrete token representations (e.g., VQ-VAE codebooks in IRIS [@micheli2023iris]) enforce compositionality and enable exact likelihood training via cross-entropy, bridging continuous perception with autoregressive dynamics.

#### Dynamics.

Stochastic latent dynamics, exemplified by DreamerV3 [@hafner2023dreamerv3], express uncertainty and multimodality through principled ELBO training and uncertainty-aware rollouts, but may degrade or become miscalibrated over long horizons. Where uncertainty modeling is less critical, deterministic value-aware dynamics (MuZero [@schrittwieser2020muzero], TD-MPC2 [@hansen2024tdmpc2]) optimize the transition function directly for downstream value prediction, trading generative flexibility for tighter integration with the control objective. Autoregressive token dynamics (iVideoGPT [@wu2024ivideogpt], LWM [@liu2024lwm]) offer a unified scalable interface that handles multiple modalities through a shared vocabulary, though long-horizon logical consistency remains a weak point. Diffusion-based dynamics (the Sora technical line [@brooks2024sora], DIAMOND [@alonso2024diamond], and interactive environments such as Genie [@deepmind2025genie3]) deliver photorealistic observation-level transitions, but the multi-step denoising they require at inference time often comes with weak action controllability.

#### Control interface.

Online MPC-style approaches (TD-MPC2 [@hansen2024tdmpc2], PETS [@chua2018pets]) replan at every step using short-horizon rollouts, providing fast correction at the cost of compute and latency pressure. Tree search and expansion (MuZero [@schrittwieser2020muzero], EfficientZero [@ye2021efficientzero]) enable counterfactual branching and systematic look-ahead, though they amplify model errors and can exploit benchmark loopholes. Rather than planning in the environment at all, imagined-rollout policy optimization (the Dreamer family [@hafner2019dreamer; @hafner2020dreamerv2; @hafner2023dreamerv3]) trains a policy entirely on model-generated trajectories, avoiding real interaction during learning but requiring highly accurate dynamics. At the deployment end, offline policy distillation (GR-1 [@wu2023gr1]) enables cheap inference yet is fragile under distribution shift, motivating OOD stress tests. A distinct strategy altogether, replayable-environment interfaces (OSWorld [@xie2024osworld], SWE-agent [@yang2024sweagent]) sidestep learned dynamics entirely, treating the real environment as its own simulator and relying on receipt parsing and state fingerprinting. More broadly, part of the control problem is deciding when external computation should be invoked at all, rather than treating tool use as either mandatory or absent; adaptive tool-integration work provides a useful planner-side example of this distinction [@wang2025tocode].

```{=latex}
\begin{table*}[ht]\caption{\textbf{Architectural building blocks for world models.} Three design axes (representation, dynamics, control interface) are cross-referenced with concrete options, representative systems, strengths, and dominant failure modes.}
\label{tab:arch_building_blocks}

\small
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{p{2.8cm}|p{13.2cm}}
\hline
\textbf{Design Axis} & \textbf{Options, Systems, Strengths, and Failure Modes} \\
\hline
\rowcolor{gray!15} \textbf{Representation} \\
\hline
\textbf{Representation} &
$\square$ \textbf{Symbolic / Programmatic:} VirtualHome. Interpretable; hard constraint enforcement. Failure: heavy manual engineering; limited state space.\\[0.3em]
& $\square$ \textbf{Latent Continuous:} DreamerV3 (RSSM); V-JEPA~2. Scalable; absorbs high-dim multimodal input. Failure: semantic drift; state aliasing over long horizons.\\[0.3em]
& $\square$ \textbf{Structured 3D:} RoboOccWorld; PointWorld. Natural physical-constraint alignment. Failure: reconstruction bottleneck; high compute cost.\\[0.3em]
& $\square$ \textbf{Discrete Tokens:} IRIS (VQ-VAE codebook). Compositional; exact cross-entropy training. Failure: codebook collapse; lossy quantization. \\
\hline
\rowcolor{gray!15} \textbf{Dynamics} \\
\hline
\textbf{Dynamics} &
$\square$ \textbf{Stochastic Latent:} DreamerV3. Principled uncertainty via ELBO; multimodal. Failure: miscalibration over long horizons.\\[0.3em]
& $\square$ \textbf{Deterministic Value-Aware:} MuZero; TD-MPC2. Tight value integration; planning-optimized. Failure: no explicit uncertainty; brittle under novelty.\\[0.3em]
& $\square$ \textbf{Autoregressive Token:} iVideoGPT; LWM. Unified multi-modal interface; scalable. Failure: weak long-horizon logical consistency.\\[0.3em]
& $\square$ \textbf{Diffusion-Based:} Sora; DIAMOND; Genie~2. Photorealistic observation-level transitions. Failure: multi-step denoising latency; weak action control. \\
\hline
\rowcolor{gray!15} \textbf{Control Interface} \\
\hline
\textbf{Control Interface} &
$\square$ \textbf{Online MPC:} TD-MPC2; PETS. Fast closed-loop correction; reactive. Failure: high per-step compute; latency pressure.\\[0.3em]
& $\square$ \textbf{Tree Search:} MuZero; EfficientZero. Counterfactual branching; systematic look-ahead. Failure: amplifies model errors; benchmark exploitation.\\[0.3em]
& $\square$ \textbf{Imagined-Rollout Policy:} Dreamer family. No real interaction during training. Failure: requires highly accurate dynamics.\\[0.3em]
& $\square$ \textbf{Offline Distillation:} GR-1. Cheap and fast deployment. Failure: distribution shift.\\[0.3em]
& $\square$ \textbf{Replayable Environment:} OSWorld; SWE-agent. Real env as simulator; attributable failures. Failure: grounding breaks under UI/API changes. \\
\hline
\end{tabular}
\end{table*}
```
Design tradeoffs across governing-law regimes {#subsec:impl_tradeoffs}
---------------------------------------------

The building blocks above are not interchangeable; the governing-law regime determines which combinations are viable and which failure modes dominate. Table `\ref{tab:compute_latency}`{=latex} summarizes how deployment-regime latency budgets constrain the viable dynamics model classes and their control interfaces.

#### Physical-world systems.

Everything hinges on contact, reachability, and stability under continuous actions. Representations must preserve geometry and contact relations; dynamics must be stable over short-to-medium horizons; and the control interface must be fast enough for closed-loop correction. Latent or structured 3D representations paired with MPC or imagined-rollout policies dominate this regime [@hafner2023dreamerv3; @hansen2024tdmpc2; @huang2026pointworld]. Short-horizon rollouts reduce compounding error [@janner2019mbpo], and MPC provides an online correction mechanism. The main pitfalls are the preassumed existence of de facto 3D scenes, degraded 3D reconstruction capabilities, semantic drift in latent space, constraint violations that remain plausible in the learned representation, and the sim-to-real gap for contact-rich interactions. It is useful to distinguish at least three transfer curves in practice: transfer across input modalities, transfer across sensor suites, and transfer across environments, since each exposes a distinct failure mode of the learned dynamics and demands its own diagnostic instrumentation.

#### Digital-world systems.

State-machine and branch consistency, rather than learned dynamics, are the primary bottleneck. Symbolic or DOM-based states paired with replayable environments are the dominant design in this setting [@xie2024osworld; @jimenez2023swebench; @yang2024sweagent]. Because they expose explicit state machines and support strong evidence logging, they make failures easier to trace and thus support Evolver-style asset distillation. This transparency, however, is not without cost: grounding may break under UI changes, loading variability and race conditions introduce non-deterministic noise, and benchmark artifacts remain vulnerable to reward gaming and to subtle shifts in the underlying software stack.

#### Social-world systems.

The dominant bottleneck is maintaining coherent agent identity and relational state across extended interactions. Persona state must persist over hundreds of turns without drift, yet Theory-of-Mind (ToM) inference, which updates beliefs about other agents' goals, knowledge, and intentions, imposes per-step costs that grow with the number of modeled agents. Multi-agent communication compounds the problem: $n$-agent interactions generate $O(n^2)$ pairwise belief updates per step, making naïve scaling infeasible for the 10,000$+$-agent simulations now appearing in the literature [@piao2025agentsociety]. Norm-consistency checking adds a further constraint: valid social rollouts must respect evolving norms (politeness conventions, negotiation protocols, institutional rules), and violations must be detectable at rollout time rather than post hoc [@zhou2024sotopia]. The overarching challenge is that agent identity is not a fixed state vector but an emergent property of interaction history; maintaining stable identity under multi-turn dynamics while still allowing genuine belief revision remains an open architectural problem that current LLM-based agents address only superficially through system-prompt conditioning.

#### Generative simulation systems.

The central tension is between visual fidelity and action controllability. High-fidelity diffusion or autoregressive models [@brooks2024sora; @bruce2024genie; @alonso2024diamond] excel at producing photorealistic outputs useful for demonstration and synthetic data generation, but action-conditioning is often unstable and long-horizon consistency is difficult. A system can be mistakenly treated as planning-ready when it is not decision-usable; evaluation should prioritize action-response consistency and long-horizon stability over raw perceptual realism [@wu2024ivideogpt; @liu2024lwm].

#### Scientific-world systems.

Evidence-chain validity and falsifiability matter more than perceptual quality in this regime (cf. the Popperian reading in Section `\ref{sec:philosophy_l3}`{=latex}). Representations must be interpretable and traceable to experimental evidence; dynamics must respect known mechanism boundaries; and the control interface should support experiment selection and belief updates rather than action execution [@wang2022scienceworld]. The distinctive risks are hallucinated mechanisms that appear plausible but lack grounding, correlation mistaken for causation, and negative results that are silently discarded rather than propagated through the model.

#### VLA vs. native world models.

A crosscutting architectural question is whether to embed world-model capabilities inside a Vision-Language-Action (VLA) model or to build a dedicated world-model module. VLAs inherit the scaling infrastructure and pretraining data of large language models, but their world-modeling capacity is implicit and difficult to isolate or evaluate. Recent efforts to make this capacity more explicit include spatially guided training that injects geometric structure into VLA policy learning [@chen2025internvla], aiming to bridge the gap between implicit visual knowledge and the explicit physical state awareness that world models require. Related work makes this implicit capacity more procedural than geometric: Pixel Reasoner equips VLMs with explicit visual operations such as zoom-in and select-frame for curiosity-driven evidence gathering, while Visual Rationale Learning treats such visual actions as core reasoning primitives rather than optional tools, together highlighting a broader shift toward explicit perceptual control inside VLM-like agents even when no standalone transition model is exposed [@wang2025pixelreasoner; @wang2025virl]. Native world models expose an explicit transition function that can be queried, composed, and stress-tested independently. Competition between these paradigms is partly a sociotechnical question: the massive investment in LLM infrastructure creates path dependencies that favor VLA-style integration even when a dedicated module might be technically superior [@hooker2021hardware]. From an evaluation standpoint, the litmus test is whether the system's predictions can be decoupled from its language generation and tested against the three boundaries (Section `\ref{subsec:l2_requirements}`{=latex}). Some architectural choices are also sociotechnical rather than purely algorithmic: whether the field converges on native world models or VLA-style surrogates may partly depend on tool ecosystems, available datasets, and hardware compatibility, besides intrinsic modeling power.

These regimes are not mutually exclusive. In practice, mature systems often stack multiple design patterns: symbolic or workflow planning at the top for high-level task decomposition, replayable environments in the middle for receipt validation and failure attribution, and short-horizon continuous control at the bottom for real-time correction [@moerland2023mbrl]. This suggests that **the relevant unit of analysis is the composed system, not any single module in isolation**. Representation, dynamics, and control should therefore be evaluated together, in light of the constraints they impose and the evidence they make available. Many apparent disagreements in the literature then look less like fundamental disputes about whether world models work, and more like differences in where systems land along these design axes.

```{=latex}
\begin{table*}[ht]\caption{\textbf{Deployment latency budgets and engineering bottlenecks by regime.} Inference latency budgets range from sub-100\,ms for real-time robotics to minutes for offline scientific planning; the table maps each regime to viable dynamics model classes and primary engineering bottlenecks. These are deployment budget ranges rather than measured benchmark results; empirical throughput depends on model size, hardware, batching, simulator implementation, and verification overhead.}
\label{tab:compute_latency}

\small
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{p{3.5cm}|p{12cm}}
\hline
\textbf{Regime} & \textbf{Latency, Dynamics Class, and Bottleneck} \\
\hline
\rowcolor{gray!15} \textbf{Physical World} \\
\hline
\textbf{Real-time robotics} &
$\square$ \textbf{Latency:} $<$100\,ms.\\[0.3em]
& $\square$ \textbf{Dynamics:} Latent dynamics + MPC; lightweight RSSM; neural ODE.\\[0.3em]
& $\square$ \textbf{Bottleneck:} Per-step inference latency; compounding error within control loop. \\
\hline
\textbf{Autonomous driving} &
$\square$ \textbf{Latency:} $<$200\,ms.\\[0.3em]
& $\square$ \textbf{Dynamics:} Occupancy flow; latent diffusion; BEV prediction.\\[0.3em]
& $\square$ \textbf{Bottleneck:} Sensor-fusion latency; safety-critical constraint verification. \\
\hline
\textbf{Embodied navigation} &
$\square$ \textbf{Latency:} 100--500\,ms.\\[0.3em]
& $\square$ \textbf{Dynamics:} RSSM; object-centric GNN; point-cloud dynamics.\\[0.3em]
& $\square$ \textbf{Bottleneck:} 3D reconstruction cost; memory for large-scale maps. \\
\hline
\rowcolor{gray!15} \textbf{Digital World} \\
\hline
\textbf{Web / GUI agents} &
$\square$ \textbf{Latency:} $\sim$1--5\,s.\\[0.3em]
& $\square$ \textbf{Dynamics:} LLM-as-world-model; DOM-based prediction; state-machine rollout.\\[0.3em]
& $\square$ \textbf{Bottleneck:} LLM inference cost; UI non-determinism and race conditions. \\
\hline
\textbf{Software engineering} &
$\square$ \textbf{Latency:} $\sim$5--30\,s.\\[0.3em]
& $\square$ \textbf{Dynamics:} LLM + MCTS rollout; code-graph traversal.\\[0.3em]
& $\square$ \textbf{Bottleneck:} Context window limits; cross-file dependency resolution. \\
\hline
\textbf{Game AI (real-time)} &
$\square$ \textbf{Latency:} $<$50\,ms.\\[0.3em]
& $\square$ \textbf{Dynamics:} Tree search (MCTS); value-aware latent dynamics; EfficientZero.\\[0.3em]
& $\square$ \textbf{Bottleneck:} Branching factor; search-depth vs.\ latency tradeoff. \\
\hline
\rowcolor{gray!15} \textbf{Social World} \\
\hline
\textbf{Social / multi-agent} &
$\square$ \textbf{Latency:} $\sim$1--10\,s.\\[0.3em]
& $\square$ \textbf{Dynamics:} ToM network; multi-agent rollout; commitment-graph update.\\[0.3em]
& $\square$ \textbf{Bottleneck:} $O(n^2)$ pairwise belief updates; persona-state drift. \\
\hline
\rowcolor{gray!15} \textbf{Scientific World} \\
\hline
\textbf{Scientific planning} &
$\square$ \textbf{Latency:} Minutes to hours.\\[0.3em]
& $\square$ \textbf{Dynamics:} Full diffusion ensemble; Bayesian surrogate; PINN; active learning.\\[0.3em]
& $\square$ \textbf{Bottleneck:} Experiment budget; surrogate calibration; data scarcity. \\
\hline
\end{tabular}
\end{table*}
```
Implementation Roadmap {#subsec:impl_roadmap}
----------------------

Table `\ref{tab:impl_roadmap}`{=latex} distills the architectural guidance from the preceding sections into a concise roadmap organized by capability level and governing-law regime. For each cell, we list the representation format that best preserves the regime's planner-critical structure, the dynamics model class that is most tractable at that capability level, and the single most important engineering bottleneck that must be addressed to reach the next level.

```{=latex}
\begin{table*}[ht]\caption{\textbf{Design roadmap across governing-law regimes.} For each regime, we summarize the representation, dynamics, and bottleneck at L1--L3.}
\label{tab:impl_roadmap}

\small
\renewcommand{\arraystretch}{1.08}
\begin{tabular}{llll}
\hline
 & \textbf{Representation} & \textbf{Dynamics} & \textbf{Bottleneck} \\
\hline

\textbf{Physical} \\
\hline
\textbf{L1} & Latent state, point-cloud input & RSSM, latent transitions & Long-horizon prediction error \\
\textbf{L2} & 3D, object-centric state & Latent MBRL, neural ODE rollout & Contact instability, constraints \\
\textbf{L3} & Physics prior, residual model & Hybrid sim-to-real adaptation & Failure attribution across modules \\
\hline

\textbf{Digital} \\
\hline
\textbf{L1} & DOM tree, UI state & LLM-based state prediction & Grounding on unseen layouts \\
\textbf{L2} & State-machine abstraction & LLM rollout, MCTS planning & Exploits, race conditions \\
\textbf{L3} & Versioned tests, execution traces & Regression-gated updates & Safe deployment, rollback \\
\hline

\textbf{Social} \\
\hline
\textbf{L1} & Belief state, dialogue history & ToM, recurrent updates & Hidden mental states \\
\textbf{L2} & Commitment graph, norm state & Multi-agent rollout & Role drift, forgetting \\
\textbf{L3} & Social model, update gates & Bayesian revision & Attribution ambiguity, ethics \\
\hline

\textbf{Scientific} \\
\hline
\textbf{L1} & Molecular graph, field state & GNN surrogate, FNO dynamics & OOD generalization \\
\textbf{L2} & Hypothesis-evidence chain & Bayesian surrogate, PINN rollout & Hallucinated calibration \\
\textbf{L3} & Protocol, surrogate model & Active Bayesian learning & Data budget, instruments \\
\hline
\end{tabular}
\end{table*}
```
Three cross-cutting engineering principles hold across all cells. First, **separate what is learned from what is enforced**: hard constraint layers (collision checkers, state-machine validators, regression gates) should be applied at inference time rather than learned implicitly, because soft enforcement through training loss cannot guarantee zero-violation rollouts. Second, **instrument before you iterate**: logging, replay, and failure attribution infrastructure should be built into the system from the start; without replay, L3 revision becomes anecdotal and ungovernable. Third, **match the representation to the planner's query**: a representation that looks realistic but does not expose the variables the planner needs (free space, permission state, reaction rate) is worse than a lower-fidelity representation that does.

Trends & Open Problems {#sec:trends}
======================

The preceding sections have established L1--L3 as a capability ladder for world models. We now situate this ladder historically (Figure `\ref{fig:historical_timeline}`{=latex}), survey the research frontier that is pushing each rung forward, and catalog the open problems whose resolution will determine whether world models mature from impressive demonstrations into reliable scientific and engineering tools.

Historical Development {#sec:trends:history}
----------------------

#### Mathematical Principles (--1956).

The impulse to build predictive models of reality long predates artificial intelligence. Newton's *Principia* [@newton1687principia] provided the first unified mathematical world model: given initial positions and velocities, his laws of motion and gravitation could, in principle, predict arbitrary future states of a mechanical system. @laplace1814essai distilled this ambition into the thought experiment now known as \`\`Laplace's demon," an intelligence that, given complete knowledge of the present, could compute the entire future of the universe. @turing1950computing then posed the question of whether machines could think, establishing the conceptual bridge from mathematical modeling to artificial intelligence. These developments established the core tension that persists today: the tradeoff between model fidelity and tractable horizon. The philosophical foundations of this progression, from Hume's empiricism through Kant's structural priors to Lakatos's governed revision, are discussed in Section `\ref{sec:philosophy}`{=latex}.

#### Symbolic Intelligence (1956--1986).

Early AI attempted to hand-code world models as logical rules and constraints. STRIPS [@fikes1971strips] introduced the first action-schema representation for robotic planning, but the Frame Problem [@mccarthy1969some] (see Section `\ref{subsec:l2_requirements}`{=latex}) revealed that every action requires explicit axioms specifying what does *not* change, a burden that grows combinatorially. The Lighthill Report [@lighthill1973artificial] catalyzed the first AI winter (1974--1980) by exposing the gap between laboratory demonstrations and real-world competence. The second winter (1987--1993) followed the brittleness of expert systems and the collapse of the Lisp-machine market: hand-crafted knowledge bases such as CYC could not gracefully handle uncertainty and commonsense exceptions [@lenat1995cyc]. The overarching lesson was clear: purely symbolic world models do not scale to open-world domains.

#### Connectionist Resurgence (1986--2020).

The revival of neural networks, from backpropagation [@rumelhart1986learning] through deep convolutional networks [@lecun1998gradient; @krizhevsky2012imagenet] and Transformers [@vaswani2017attention], shifted the paradigm from hand-coded rules to learned representations. World models re-emerged in model-based reinforcement learning, from latent dynamics models to general pixel-based control (see Section `\ref{sec:l1}`{=latex} for details).

#### Generative Revolution (2020--present).

Diffusion models [@ho2020ddpm] and large-scale language models such as GPT-3 [@brown2020gpt3] have catalyzed a qualitative shift, building on the Transformer backbone established in the preceding era. Video generation models [@brooks2024sora; @nvidia2025cosmos; @bruce2024genie] and LLM-based agents [@hao2023reasoning; @wang2023voyager] are blurring the boundary between prediction and simulation, though systematic physics violations persist [@gu2025phyworldbench] (Sections `\ref{sec:l1}`{=latex}--`\ref{sec:l2}`{=latex}). More broadly, the field is converging toward a *neuro-symbolic* frontier [@marra2024neurosymbolic_survey; @zhao2026neurosymbolic_synergy] that combines neural dynamics modules for learning transition functions (L1/L2) with symbolic components for constraint enforcement and hypothesis-space expansion (L3).

Across all four eras, representation learning serves as shared infrastructure: the quality of the learned state $z_t$ determines the ceiling for prediction (L1), simulation (L2), and revision (L3) alike. Whether the representation is a latent vector, a discrete token sequence, a 3D point cloud, or a program, the governing-law regime determines which invariants the representation must preserve.

This historical arc suggests a consistent lesson: progress in world modeling has come not from scale alone, but from changing what is represented, what is compositional over horizon, and what can be revised from evidence. The open problems below are organized around remaining bottlenecks at L1, L2, and L3.

Open Problems by Capability Level {#sec:trends:open}
---------------------------------

The preceding sections reveal a clear trajectory: world models are progressing from isolated one-step predictors toward integrated, agent-facing simulators that must respect domain-specific governing laws over extended horizons. Across all four regimes, this progression exposes a common pattern. In embodied domains, visual plausibility is outpacing physical faithfulness: models generate convincing video but violate conservation laws and object permanence under rollout, with the best systems achieving only 0.262 success rate on physical-consistency tests [@gu2025phyworldbench; @li2025embodied_wm_survey]. In social domains, large-scale agent simulations reproduce emergent phenomena such as opinion polarization and governance formation [@piao2025agentsociety; @dai2024artificialleviathan], but LLM agents exhibit systematic biases toward consensus that diverge from human behavioral patterns [@taubenfeld2024biases; @chuang2024opinion]. In code domains, agents treat software as deterministic state machines while real systems are partially observable, asynchronous, and multi-tenant [@xu2025warex]. In scientific domains, neural surrogates trained on simulation data degrade when applied to real experimental measurements, exposing a surrogate-to-reality gap analogous to sim-to-real in robotics [@minami2025simtorealsciml]. The overarching theme is that the bottleneck has shifted from generating plausible futures to ensuring those futures are *decision-usable*: faithful to governing constraints, responsive to interventions, and calibrated against real-world evidence.

We organize ten concrete open problems by the capability level at which they most directly arise.

#### Representation and Local Prediction.

1.  **Physical faithfulness beyond visual plausibility.** Current video and 3D world models achieve perceptual realism but fail physical-consistency tests: PhyWorldBench [@gu2025phyworldbench] reports that the best of twelve frontier models attains only a 0.262 success rate on conservation-law and object-permanence probes, with long-horizon error accumulation as the core structural weakness. Closing this gap requires physically grounded representations that enforce constraints under counterfactual rollout, not merely pixel fidelity, where representation choices discussed in Section `\ref{sec:trends:representation}`{=latex} may offer one possible direction. Spatially guided training strategies that inject geometric supervision into vision-language-action models [@ye2026st4vla] offer one promising direction.

2.  **Metric-aware video world modeling.** Extending geometry-grounded editing from image pairs to temporally coherent video demands four coupled abilities: metric estimation across time, temporal composition of short-step predictions, identity and appearance preservation across frames, and instruction grounding that aligns predicted motion with semantic specifications [@ho2022videodiffusion; @xing2023dynamicrafter]. Moving to video provides denser temporal supervision and stronger identity constraints than image-pair approaches. Subject-faithful controllable editing methods such as RealCustom++ [@mao2026realcustompp] may supply useful interface components. Evaluation must measure metric controllability directly, not just perceptual quality [@huang2023vbench; @ge2024fvdbias].

3.  **Programmable visual representation.** Current visual world models represent state as raw pixels or latent embeddings, neither of which is compositional or precisely editable. Code offers a structured alternative: VCode [@lin2025vcode] reconstructs images as SVG programs preserving symbolic semantics over pixel fidelity, Code2Video [@chen2025code2video] shows that executable Manim scripts outperform pixel-generation models on structured content by making every spatial and temporal element directly addressable, and VIGA [@yin2026vision] extends the paradigm to 3D by reconstructing scenes and simulating physical interactions through generated Blender code. The open problem is unifying these code-based representations into a single world-model interface for both 2D and 3D compositional editing.

#### Simulation Fidelity and Intervention.

1.  **Partially observable software as a POMDP.** No existing code world model maintains belief distributions over hidden backend state (server sessions, database rows, in-flight requests, and background processes), nor reasons about asynchronous transitions with variable latency. Injecting realistic asynchronous failures into standard benchmarks causes significant drops in task success across all state-of-the-art agents [@xu2025warex]. Solving this requires temporal-belief architectures that jointly model what has happened, what is in progress, and what the agent cannot yet observe.

2.  **Concurrent multi-user state.** Real software is multi-tenant; world models must predict state under a Dec-POMDP where concurrent users' actions are unobservable [@wang2025decpomdp]. Conflict-free replicated data types [@kleppmann2017crdt] provide the formal substrate for merging concurrent updates, but no current world model integrates distributed-systems semantics with learned belief tracking over hidden users and pending writes. This problem sits at the intersection of the Digital World and Social World regimes, requiring joint reasoning about software state and multi-agent intent.

3.  **Agent-human behavioral alignment at scale.** LLM agents exhibit systematic biases toward moderation and consensus [@taubenfeld2024biases; @chuang2024opinion], producing two failure modes: mode collapse, where diverse simulated populations converge to homogeneous behavior, and calibration inadequacy, where single-turn persona alignment fails under multi-turn dynamics. Language and cultural priors can inject diversity, but this effect diminishes as cultural distance between populations shrinks. Systematic methods for grounding simulated behavior in real human behavioral distributions is lacking.

#### Evidence-Driven Revision and Self-Evolution.

Existing autonomous science flagships such as CAMEO [@kusne2020cameo] and A-Lab [@szymanski2023alab] demonstrate that closed-loop model revision is feasible in highly instrumented domains, while evaluator-guided algorithmic discovery systems such as FunSearch [@romera2024funsearch] and AlphaEvolve [@alphaevolve2025] demonstrate partial L3 loops with strong validation (Section `\ref{sec:l3}`{=latex}). Several open problems must be solved before L3 generalizes.

1.  **Continual learning of societal transition functions.** Large-scale simulations with 10,000+ agents across millions of interactions reproduce emergent phenomena such as opinion polarization and governance formation [@piao2025agentsociety; @dai2024artificialleviathan], yet cannot autonomously detect when social dynamics have shifted. The core challenge is to identify an outdated transition model, acquire corrective evidence, and revise without catastrophic forgetting of stable patterns [@vandeven2024continuallearning]. This problem connects to L3, where model revision must be triggered by distributional evidence rather than supervised labels.

2.  **Closing the surrogate-to-reality gap.** Scientific surrogates validated on simulation data degrade on real measurements; prediction error decreases as a power law with computational data but plateaus without real-data calibration, mirroring the sim-to-real gap in robotics [@minami2025simtorealsciml]. In the scientific regime, this surrogate-to-reality gap is the direct analogue of sim-to-real transfer in robotics; the implementation-side mitigations discussed in Section `\ref{subsec:impl_tradeoffs}`{=latex} therefore provide a useful template, even though the measurement bottlenecks and evidence budgets differ. Notably, L2 scientific simulators such as GraphCast, NeuralGCM, and Aurora (Section `\ref{sec:l2}`{=latex}) provide the prediction substrate on which L3 revision operates; their fidelity sets the ceiling for downstream evidence-driven diagnosis. The OPAL-surrogate framework [@singh2024opal] provides hierarchical Bayesian credibility gates that formalize when a surrogate is trustworthy. The central open question is how to allocate scarce real experimental observations optimally between model calibration and scientific discovery.

3.  **Modeling laws that themselves evolve.** In biology, ecology, and climate, governing dynamics are non-stationary: viral fitness landscapes shift [@lassig2017predicting], climate forcing alters atmospheric dynamics [@beucler2024climate], and evolutionary pressures create tipping points [@evangelou2024coevolving]. World models must learn second-order meta-transition operators governing how $p_\theta$ itself drifts, together with revision triggers that detect law change from observational evidence. Causal discovery under non-stationarity [@huang2020cdnod; @song2023nctrl] provides identifiability results but treats change as variation within a fixed meta-model rather than as structural law replacement.

4.  **Harness designs for agentic world modeling.** Agent performance has evolved through three successive abstractions: prompt engineering optimizes what the model is told, context engineering [@anthropic2025context] curates the information state across turns, and harness engineering designs the executable environment surrounding the model: tools, memory, feedback loops, and inter-agent topology [@rajasekaran2026harness; @pan2026nlah]. This progression implies that agent behavior is governed not by the model alone but by transition dynamics of its execution environment, making harness design a form of world modeling for software agents. The problem is how to learn and synthesize harnesses from interaction data, treating the execution environment itself as the object of modeling rather than a fixed engineering assumption.

#### Cross-regime shared challenges.

Despite the diversity of governing-law regimes, three open problems recur across all four domains and constitute the deepest bottlenecks for world modeling in agentic AI. *Deployment shift*: world models trained on offline data or simulation systematically underperform when the environment drifts. UI layouts change, physical contact properties shift, social norms evolve, and scientific instruments recalibrate. Robust world modeling requires online mechanisms that detect distribution shift early and trigger targeted revision rather than waiting for catastrophic failure. *Constraint enforcement*: all four regimes have governing laws that valid trajectories must satisfy (contact stability, state-machine consistency, norm compliance, evidence-chain validity), yet current models enforce these constraints only softly through training objectives; hard enforcement at inference time, via symbolic layers, constrained rollout, or verification gates, remains an open architectural problem. *Persistent update governance*: L3 systems that revise themselves from evidence face a trilemma of stability (avoid regressing on past capabilities), plasticity (incorporate new evidence quickly), and auditability (trace every update to its evidence source); no current system resolves all three, and the governance infrastructure (versioning, canary deployment, rollback policies, and regression harnesses) is underspecified in most published architectures. The MREP framework proposed in Section `\ref{sec:evaluation}`{=latex} offers a starting point for standardizing evaluation across these shared challenges by providing version-locked, reproducible evaluation packages that make cross-regime comparison tractable.

Beyond L3 {#sec:trends:beyond}
---------

The L1$\to$L2$\to$L3 hierarchy (Sections `\ref{sec:l1}`{=latex}--`\ref{sec:l3}`{=latex}) assumes that the world operates under a fixed set of governing laws: L1 learns local regularities, L2 composes them into constraint-consistent rollouts, and L3 revises the model when evidence contradicts predictions. At L3, the system can update the laws governing its world model, but these updates remain grounded in explaining a single underlying reality.

A natural extension is **meta-world modeling**: systems that reason not only about a particular transition function, but about the space of possible transition functions itself. Rather than refining a model of the observed world, such systems would explore alternative rule systems that define different possible environments, for example by varying, extending, or constructing new assumptions, constraints, or governing principles.

As discussed in Section `\ref{sec:trends:representation}`{=latex}, the ability to explicitly represent and manipulate such principles becomes increasingly important in this setting. In particular, symbolic representations may provide a more natural interface for meta-world modeling, as they allow governing rules to be directly modified, composed, and compared across alternative worlds.

What forms this capability might take, whether through program synthesis, open-ended evolution, procedural world generation, or other mechanisms, remains an open question. More broadly, this raises the question of whether the endpoint of world modeling lies in increasingly accurate models of a world, or in systems that can systematically explore and reason over multiple worlds defined by different governing laws. We expect different research communities to arrive at different formulations of this capability, depending on whether the focus lies on predictive performance, scientific understanding, or generative modeling, and we leave the question of what constitutes the ultimate world model as an open invitation to the broader community.

Conclusion {#sec:conclusion}
==========

This paper has proposed a capability-based taxonomy for world modeling organized along two axes: three capability levels (Predictor, Simulator, Evolver) and four perspectives (physical, digital, social, scientific).

An **L1 Predictor** learns local transition operators $p_\theta(z_t \mid z_{t-1}, a_t)$ whose quality is measured by one-step calibration, robustness, and identifiability (Section `\ref{sec:l1}`{=latex}). An **L2 Simulator** composes these operators into long-horizon, action-conditioned rollouts that must satisfy three boundary conditions (long-horizon coherence, intervention sensitivity, and constraint consistency) under the governing laws of the target domain (Section `\ref{sec:l2}`{=latex}). An **L3 Evolver** closes the loop by autonomously designing experiments, collecting evidence, and revising its dynamics model when predictions fail (Section `\ref{sec:l3}`{=latex}). Current systems such as CAMEO [@kusne2020cameo] and A-Lab [@szymanski2023alab] in autonomous science provide the strongest evidence that closed-loop model revision is already feasible in well-instrumented domains, while evaluator-guided algorithmic discovery systems such as FunSearch [@romera2024funsearch] and AlphaEvolve [@alphaevolve2025] show how automated scoring and regression gates can support partial L3 loops.

Organizing domains by **governing-law** regime rather than by modality has revealed both shared principles and irreducible differences (Section `\ref{sec:l2}`{=latex}). *Physical*-world systems benefit from geometric and conservation-law priors; *digital*-world systems exploit deterministic program semantics; *social*-world systems demand Theory-of-Mind representations; and *scientific*-world systems couple models to experimental evidence streams. The L1$\to$L2$\to$L3 taxonomy applies uniformly across these regimes, but the content of each level (what constitutes a valid rollout, what counts as a law violation, what evidence is available) varies fundamentally.

To make capability claims testable, we proposed decision-centric evaluation principles and a minimal reproducible evaluation package (Section `\ref{sec:evaluation}`{=latex}), and provided an architectural roadmap mapping representation, dynamics, and control choices to each capability level and deployment regime (Section `\ref{sec:implementation}`{=latex}). The open problems in Section `\ref{sec:trends}`{=latex} trace a research agenda: causal representation learning at L1, law-consistent rollout and compositional generalization at L2, and safe autonomous experiment design and model revision at L3.

A cross-cutting theme emerging from this survey is the question of representation substrate. The L1$\to$L2$\to$L3 progression describes what a world model can do, but leaves open what form it should take. Latent continuous representations have proven indispensable for learning transition operators at scale, yet the history of scientific discovery suggests that law revision, the hallmark of L3, has typically relied on symbolic substrates: Newton's laws, Maxwell's equations, and the Standard Model are all world models whose governing principles are explicit, composable, and directly revisable. Current neural world models encode invariances implicitly through architecture and training, which suits L1 and L2 but becomes a liability at L3, where the task is to revise model structure itself. We therefore view the development of world models that can discover and manipulate symbolic governing laws from data, rather than merely absorbing them into latent representations, as one of the most important open problems in the field. How to extend this to physical, digital, social, and scientific worlds remains a fundamental challenge.

```{=latex}
\begin{takeaway}[Key Takeaway]The future of agentic AI lies not in larger predictors, but in models that internalize the governing laws of the world, simulate its dynamics, and continuously evolve themselves through active trial-and-error loops, enabling them to navigate, interpret, and ultimately reshape the world.
\end{takeaway}
```
```{=latex}
\clearpage
```
```{=latex}
\clearpage
```
```{=latex}
\small
```
```{=latex}
\bibliographystyle{abbrvnat}
```
```{=latex}
\clearpage
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Philosophical Motivations for Hierarchical World Modeling {#app:philosophy_extended}
=========================================================

This appendix expands the philosophical motivations behind the L1/L2/L3 hierarchy of Section `\ref{sec:philosophy}`{=latex}.

L1: Inductive Priors and the No-Free-Lunch Theorem
--------------------------------------------------

The i.i.d.  premise in supervised learning mirrors Hume's uniformity-of-nature assumption: without it, induction has no logical warrant. @wolpert1996nofreelunch formalizes a complementary point: absent structural constraints, no learner is universally better than any other. Architectures and training curricula therefore serve as *inductive priors* analogous to what @kant1781critique called *synthetic a priori* structures. In practice, mechanisms such as convolutions, equivariances, attention patterns, and JEPA-style prediction targets [@lecun2022path] constrain which regularities are even expressible. LeCun's JEPA-centric architecture (perception, world model, actor, cost, short-term memory) can be viewed as one particular choice of such priors; our taxonomy classifies the **capabilities** these design choices enable, regardless of module count or naming convention.

This perspective is supported by cognitive science. The predictive coding framework in cognitive science [@rao1999predictive] posits that the brain continuously generates top-down predictions of incoming sensory signals and updates its internal model by minimizing prediction error. Friston's active inference framework [@friston2010free] unifies perception, planning, and exploration under variational free energy, blurring the boundary between prediction and action. The \`\`Bayesian brain" hypothesis [@clark2015surfing] holds that perception itself is a form of probabilistic inference over latent causes of sensory input, suggesting that one-step latent forecasting is a primitive from which richer world-modeling capabilities emerge [@lake2017building].

Contemporary systems that embody L1's Humean empiricism include Dreamer-style latent prediction [@hafner2019dreamer; @hafner2020dreamerv2; @hafner2023dreamerv3] and large sequence models for short-horizon forecasting. These systems extract temporal regularities from trajectories and assume persistence, wagering next-step accuracy.

L2: Modal Semantics, Epistemic Drift, and Active Inference
----------------------------------------------------------

Modal semantics (\`\`possible worlds," \`\`closest" counterfactuals) [@kripke1963semantic; @kripke1980naming; @stalnaker1968conditionals] supplies helpful **vocabulary** (latent states index alternative futures; actions carve navigable branches), but the core engineering content is interventional rather than purely modal.

Lewis's theory of \`\`closest possible worlds" [@lewis1973counterfactuals] provides an operational core of counterfactual reasoning: effective reasoning does not involve exploring arbitrary possibilities but analyzing worlds that are maximally similar to our own, where a minimal intervention yields a coherent trajectory. This \`\`near-factual" heuristic lets L2 simulators remain tractable while still supporting intervention-structured imagination.

MuZero's search over learned dynamics [@schrittwieser2020muzero] is a concrete instance: it uses Monte Carlo Tree Search to explore action sequences in a learned model, demonstrating how Lewis's \`\`closest possible worlds" theory translates into practical AI systems that enable counterfactual reasoning for decision-making.

Plato's cave offers a diagnostic image (shadows projected from an incomplete generator [@plato1992republic]) as a reminder that visual fluency does not guarantee fidelity. An L2 simulator that excels at predicting shadows on a wall may remain fundamentally bounded by the wall's dimensions, unable to access the fire that casts those shadows. This metaphor captures **epistemic drift**: internally coherent trajectories that leave the training manifold. When the data distribution used to train the simulator does not align with the true \`\`fire" of reality, the L2 agent becomes trapped within its own simulated shadows. Its \`\`modal stability" thus becomes its greatest liability: the more a model relies on its internal assumptions to fill in the gaps of missing data, the more brittle it becomes when confronted with the irreducible complexity of the real world.

Friston's active inference framework [@friston2017active] addresses a similar concern from the opposite direction: by unifying perception, planning, and exploration under variational free energy, it blurs the L1/L2 boundary into a continuum. Our stage boundary remains useful as an engineering diagnostic even when the theory is continuous: the question is whether rollouts are reliable enough downstream.

L3: Falsifiability, Paradigm Shifts, and Abduction
--------------------------------------------------

Popper's emphasis on **risky, falsifiable predictions** [@popper1959logic] resonates with governed L3 loops: proposed revisions should yield measurable improvements on held-out probes, regression suites, or experimental outcomes, not merely post-hoc accommodation. @kuhn1962structure's \`\`paradigm shift" provides a useful **contrast**: Kuhn stresses non-cumulative revolutionary breaks, whereas most engineered L3 systems today perform **incremental** scaffold updates, closer to Lakatos's progressive problem-shifts than to Kuhnian revolution.

Peircean **abduction** (inference to the best explanation) [@peirce1931collected] loosely motivates the hypothesis-generation step when monitors flag anomaly, but the analogy should not be pressed too far: contemporary systems typically search within *structured* spaces (program sketches, simulator hooks, experiment templates) [@lu2024aiscientist; @yamada2025aiscientistv2] rather than inventing wholly new ontologies.

Lakatos's methodology of scientific research programmes [@lakatos1978methodology] offers a precise framework for understanding model revision. Systems have a \`\`hard core" of fundamental principles (the model's architecture and inductive biases) and a \`\`protective belt" of auxiliary hypotheses (the learned parameters). In conventional training (L1 and L2), errors are absorbed by the protective belt via gradient descent. When persistent structured errors occur, however, an L3 system identifies that the crisis lies within the hard core itself and performs what Kuhn called a \`\`paradigm shift", reorganizing its internal ontology.

The Duhem--Quine thesis of confirmation holism [@duhem1954aim; @quine1951two] explains why blame-assignment is non-trivial: errors redistribute across modules until diagnostics isolate the brittle component. This holism complicates the transition from parameter adjustment to structural revision, making evidence-driven diagnosis via held-out probes and targeted ablations a central capability for L3 systems.

Moerland et al.'s model-based RL taxonomy [@moerland2023mbrl] organizes methods along axes such as how dynamics integrate with policy optimization; those axes are **orthogonal complements** to ours; two algorithms with identical integration strategy can sit at different capability levels depending on simulation depth and whether the model structure itself changes under evidence.

Conceptual Boundaries {#app:boundaries}
=====================

World Modeling versus Generic Prediction
----------------------------------------

Unlike standard machine learning prediction (e.g., classification, recommendation), world modeling targets *stateful dynamics*: how environments evolve over time under actions or interventions. The three L2 boundary conditions defined above (long-horizon coherence, intervention sensitivity, and constraint consistency) already capture the core of this distinction at the simulation level. A fourth capability further separates world modeling but is **orthogonal** to the L1/L2/L3 hierarchy:

-   **Closed-loop use:** supporting planning, acting, and self-improvement through interaction with the modeled environment. This capability is essential for embodied agents but is **not** part of the L1/L2/L3 definition: a weather emulator or video generator can be an L2 world model with no embedded planner. Conversely, strong closed-loop performance additionally requires exploration, reward specification, safety, and search; failure modes that are not always traceable to dynamics error alone [@kaelbling1998pomdp; @schrittwieser2020muzero].

A useful practical distinction is that world models are organized around action-conditioned transition queries: by conditioning on actions, they compress the simulation problem into decision-relevant futures rather than attempting to model all observable variation indiscriminately.

World Model versus Planner {#subsubsec:wm_vs_planner}
--------------------------

The *world model* is **descriptive**: it approximates how states and observations evolve under actions or interventions. The *planner* is **normative**: it chooses actions to optimize an objective given those predictions [@kaelbling1998pomdp; @puterman1994mdp]. Conflating the two obscures where failures originate (wrong dynamics vs. wrong objectives or search) and blocks reuse of one model with many planners.

The world model corresponds to $T$, $O$, and learned $(q_\phi, p_\theta, p_\psi)$; the planner is $\pi$, a value function, or a search procedure consuming rollouts (MCTS, imagined trajectories, etc.) [@schrittwieser2020muzero; @hafner2023dreamerv3]. Different planners can sit atop the same dynamics; different dynamics can be swapped into the same planner for ablation. Forecasting simulators and video models illustrate world models without planners [@brooks2024sora; @ding2024survey_wm]; model-based RL typically *co-trains* dynamics and policy (Dreamer-style), yet the **roles** remain distinct [@hafner2019dreamer; @hafner2023dreamerv3].

The planner issues *queries* (one-step forward, multi-step rollout, counterfactual edits); the world model answers them. L1/L2/L3 classify **query depth and reliability**, not the planner's existence.

World Modeling versus Commonsense Modeling
------------------------------------------

World modeling targets **how** states evolve; commonsense supplies **what usually persists, what is relevant, and what must not break** across transitions [@johnson1983mental; @lecun2022path; @marcus2018deep]. However, neither alone suffices a comprehensive modelling of the real world: fluent rollouts can violate invariants, while static \`\`facts" without dynamics cannot support control. For example, a world model may generate a visually smooth rollout of a cup falling, yet without commonsense or physical invariants it may let the cup pass through a solid table; conversely, knowing that solid objects do not interpenetrate does not by itself predict the exact trajectory after a push.

World modeling supports predictions $z_{t-1} \rightarrow z_t$, $(z_{t-1},a_t)\rightarrow z_t$, and $z_t\rightarrow o_t$, including L1 operators and L2 trajectories. Commonsense encodes persistence, default invariants, and normative structure, classically linked to the **frame problem** (specifying what does *not* change) [@mccarthy1969some]. In L2, **constraint consistency** operationalizes a **testable subset** of commonsense: violations are measurable against explicit rules or domain simulators, even though full commonsense remains open-ended.

Taxonomy Comparison {#subsec:taxonomy_comparison}
-------------------

Our L1/L2/L3 hierarchy is not the only capability decomposition for intelligent systems. We situate it against four influential frameworks below. Correspondences are *partial and analogical*, not one-to-one: each comparison framework was designed with a different scope and computational commitment, and the levels do not map cleanly onto L1/L2/L3 in every case. The key distinction is that our taxonomy is *architecture-agnostic* and *cross-domain*: it applies uniformly across physical, digital, social, and scientific regimes.

1.  **Pearl's causal hierarchy** [@pearl2009causality]: L1 $\approx$ Association; L2 $\approx$ Intervention. The counterfactual rung reasons within a fixed model, whereas L3 revises the model itself.

2.  **LeCun's autonomous-machine architecture** [@lecun2022path]: L1 $\approx$ Reactive; L2 $\approx$ Planning via world model. This framework lacks an explicit model-revision stage comparable to L3.

3.  **Friston's active inference** [@friston2010free]: L1 $\approx$ Perception; L2 $\approx$ Planning. Provides a unified Bayesian principle but includes no discrete revision stage comparable to L3.

4.  **Moerland et al.'s MBRL survey** [@moerland2023mbrl]: L1 $\approx$ Model learning; L2 $\approx$ Planning with learned model. Organizes by integration strategy rather than capability level, and is RL-specific.

Detailed L2 Systems by Different Domains {#app:l2_extended}
========================================

This appendix provides extended details for methods summarized in Section `\ref{sec:l2}`{=latex}.

Embodied and 3D Systems
-----------------------

Two recurring dimensions help compare these systems: the **representation carrier** (occupancy, point/particle, Gaussian, or asset-based scene state) and the **degree of action-coupling**, ranging from passive continuation to action-conditioned forecasting and simulator-ready planning support.

#### 3D-structured world models.

A visible trend in learned physical-world modeling is the shift from appearance-first continuation toward geometry-carrying simulation. Rather than merely extending pixels frame by frame, these systems increasingly maintain explicit or semi-explicit 3D scene structure that can support action-conditioned forecasting, collision reasoning, and planning.

*Industrial previews.* As illustrative industrial signals rather than peer-reviewed systems, World Labs' 2025 previews point in this direction: RTFM emphasizes persistent real-time interaction, while Marble lifts text, images, and coarse 3D layouts into explorable 3D worlds [@worldlabs2025rtfm; @worldlabs2025marble].

*Geometry and volumetric forecasting.* Among academic systems, Aether couples reconstruction, action-conditioned prediction, and visual planning in a geometry-aware framework [@zhu2025aether], while TesserAct formulates embodied forecasting as learning 4D scenes with spatial and temporal consistency [@zhen2025tesseract]. RoboOccWorld makes geometric commitment explicit at the volumetric level by forecasting scene evolution directly in occupied 3D space, so that whether a structure is occupied becomes a first-class output for collision checking, motion planning, and multi-step spatial reasoning [@zhang2025robooccworld].

*Fine-grained physical representations.* Other work moves toward more detailed simulation substrates. GAF represents robotic interaction with 4D Gaussian fields [@chai2025gaf], ParticleFormer models point-cloud dynamics for multi-material manipulation [@huang2025particleformer], and GWM treats Gaussian primitive propagation as both a neural simulator and a representation-learning substrate [@lu2025gwm]. LiDARCrafter [@liang2026lidarcrafter], DynamicCity [@bian2025dynamiccity], and U4D [@xu2025u4d] simulate 4D worlds from native 3D representations, such as point clouds and occupancy grids. PointWorld further unifies state and action as 3D point flows, targeting cross-embodiment manipulation [@huang2026pointworld]. Taken together, these methods suggest a shift from generic video continuation toward representations that expose finer operational structure for contact-rich, action-sensitive dynamics.

*Simulation-ready extensions.* A related direction focuses less on forecasting alone and more on producing assets and views that can directly support downstream interaction. PhysX-Anything generates articulated, physical 3D assets from a single in-the-wild image for direct use in simulation [@cao2025physxanything], while MVISTA-4D extends toward single-view-to-arbitrary-view RGBD imagination with test-time action optimization [@wang2026mvista4d]. Across these systems, the world model becomes less like a passive renderer and more like a spatially queryable latent scene machine. Broadly, occupancy-based methods favor global free-space reasoning and collision checking but remain relatively coarse; point- and particle-based methods better capture contact-rich local dynamics but are harder to scale; Gaussian-style representations improve visual and spatial fidelity but often require additional structure for strict physical interaction.

#### Autonomous driving world models.

Autonomous driving is a particularly clear L2 setting because useful rollouts must jointly preserve geometric accuracy (lane structure, free space), dynamic consistency (vehicle kinematics, traffic flow), and counterfactual sensitivity: if the ego vehicle brakes earlier or changes lanes, surrounding trajectories and occupancy should update coherently rather than merely continuing the same scene [@survey_vla4ad; @wang2025alpamayo; @worldlens]. Earlier systems like GAIA-1 [@hu2023gaia1] and DriveWorld [@min2024driveworld] established scene generation conditioned on control signals.

Subsequent work has branched along two main axes. Along the *representation* axis, Copilot4D [@zhang2024copilot4d] introduced unsupervised 4D modeling via discrete diffusion on LiDAR point clouds, OccWorld [@zheng2024occworld] moved to 3D occupancy with a GPT-like spatial-temporal transformer, and Hermes [@zhou2025hermes] unified BEV scene understanding with future generation. Along the *fidelity--controllability* axis, VISTA [@gao2024vista] demonstrated 576$\times$1024 resolution at 10 Hz with 15-second coherent rollouts, while DriveDreamer [@wang2024drivedreamer] built world models entirely from naturalistic driving data using a diffusion backbone. AD-R1 [@yan2025ad-r1] builds the first closed-loop simulator by combining impartial world modeling with a rich curriculum of plausible collisions and off-road events.

A further line of work concerns *policy alignment under fine-tuning* rather than base representation alone: AdaWM [@zhao2025adawm] addresses representation degradation during RL fine-tuning via low-rank alignment that preserves pre-trained structure while adapting to new driving policies. This progression also marks a shift from open-loop scene generation to closed-loop control support, where actions are not merely conditioning variables but candidate interventions whose consequences must be compared before execution.

Software, Web, and Game Systems
-------------------------------

#### Game world models.

Game worlds occupy a distinctive position at the intersection of physical and digital intelligence: visual dynamics follow physics-like rules (rendering, object motion, collision), yet transitions are ultimately governed by deterministic game logic (score updates, level triggers, inventory changes). This overlap makes games a natural testbed for world models that must integrate perceptual prediction with rule-based reasoning. NitroGen [@nitrogen2025], NVIDIA's open vision-action foundation model trained on 40K hours of gameplay across 1000+ games, achieves 52% improvement on unseen games via large-scale behavior cloning. Earlier work at L1, including DIAMOND [@alonso2024diamond] and Genie [@bruce2024genie] (Section `\ref{sec:l1}`{=latex}), established frame-by-frame prediction; the L2 challenge is long-horizon, action-conditioned simulation respecting both visual dynamics and underlying game rules. GameNGen [@valevski2025gamengin] demonstrated that a diffusion model trained on DOOM gameplay can serve as a real-time neural game engine at 20 FPS, generating interactive frames indistinguishable from the original engine. Video2Game [@xia2024video2game] converts a single video into an interactive 3D game-like environment with real-time physics and rendering, bridging passive video understanding with interactive world simulation. Across these domains, state includes DOM structure, focus, file system, and application state machines; evaluable tasks span OS [@xie2024osworld; @yang2025macosworld], web [@zhou2023webarena; @deng2023mind2web; @yao2022webshop], and software debugging workflows [@jimenez2023swebench; @yang2024sweagent; @shi2017worldofbits].

Social Simulation and Multi-Agent Systems
-----------------------------------------

#### ToM prompting and reasoning.

Structured prompting strategies suggest the bottleneck in social reasoning is reasoning structure rather than knowledge. SymbolicToM [@sclar2023symbolictom] constructs explicit per-character belief graphs after each story event, supporting up to third-order beliefs through graph traversal (ACL 2023 Outstanding Paper). SimToM [@wilf2024simtom] implements perspective-taking as a two-stage process inspired by Simulation Theory from cognitive science: first filtering context to what the target character knows, then answering from that filtered view. K-Level Reasoning [@zhang2025k] implements the behavioral economics Level-K framework recursively in LLMs for negotiation. Thought-Tracing [@kim2025hypothesis] implements approximate Bayesian inference via Sequential Monte Carlo-like hypothesis generation, significantly outperforming reasoning models like o3-mini, suggesting social reasoning may require fundamentally different computational mechanisms than mathematical deduction.

#### Sandbox architectures and scale.

Project Sid [@altera2024projectsid] deployed up to 1,000 agents across six towns in Minecraft using the PIANO architecture (Parallel Information Aggregation via Neural Orchestration), a brain-inspired modular design with separate concurrent modules for cognition, planning, motor execution, and speech. Emergent phenomena included autonomous professional specialization, personality-driven social network formation, democratic governance, and cultural transmission including spontaneous religious proselytization. Sotopia extensions include Sotopia-$\pi$ [@wang2024sotopia_pi] (interactive self-reinforcement learning for social skills) and Lifelong-Sotopia (multi-episode long-term consistency evaluation). AgentSociety [@piao2025agentsociety] simulated 10,000+ agents generating 5 million interactions in an integrated urban-social-economic environment with emotion and cognitive modeling inspired by Maslow's hierarchy. Deployed platforms such as Moltbook[^1] provide persistent social environments where AI agents autonomously post, discuss, and form community norms, bridging the gap between simulation and real-world agent societies.

#### Emergent social phenomena.

Only 2 of 15 LLMs achieve sustainable cooperation in commons dilemma scenarios [@piatti2024cooperateorcollapse], and cooperation evolution across generations of LLM agents proves strongly model-dependent [@vallinder2024culturalevolution]. Yet norms and conventions do emerge: @ren2024norms document norm formation in LLM societies, and @ashery2025conventions find social conventions with critical mass tipping points, where collective biases appear at the group level that do not exist in individual agents. Melting Pot [@leibo2021meltingpot] provides 50+ substrates covering cooperation, competition, deception, and coordination for systematic evaluation of such dynamics. Role-playing systems such as RoleLLM [@wang2024rolellm], CharacterLLM [@shao2023characterllm], and ChatHaruhi [@li2023chatharuhi] probe character-consistency through persona fine-tuning and memory-based maintenance. @shanahan2023roleplay argue that LLMs maintain implicit world models of character situations through distributional representations. Werewolf and Avalon serve as concentrated testbeds for deception and trust: comprehensive Avalon investigation [@lan2024avalonllm] documented emergent leadership and camouflage strategies, ReCon [@wang2024recon] introduced recursive perspective transitions for deception handling, and The Traitors [@curvo2025traitors] found that deceivers consistently prevail by exploiting the cognitive limitations of honest participants.

#### Digital twin societies.

S$^3$ [@gao2023s3] simulates information propagation, emotion contagion, and attitude polarization on social media platforms; an extended version successfully predicted 2024 US presidential election results, demonstrating predictive validity for real-world phenomena. SocioVerse [@zhang2025socioverse] validates social simulation against a pool of 10 million real-world users, enabling election prediction, breaking-news response, and economic survey replication at unprecedented scale. PersuasionForGood [@wang2019persuasionforgood] modeled persuasion as a social state transition process, tracking how 10 distinct strategies shift attitudes, establishing that social dynamics are personalized rather than universal.

#### Institutional and formal approaches.

As @dignum2026agentifying argue, current LLM-based agents exhibit behavioral autonomy without explicit reasoning structures. The BDI (Belief--Desire--Intention) architecture [@rao1995bdi], normative multi-agent systems [@boella2007norms], electronic institutions [@esteva2001electronic], and formal commitment models [@telang2023commitments] provide the missing machinery: explicit, inspectable representations of mental states, social obligations, and institutional roles. MetaGPT [@hong2024metagpt] encodes organizational knowledge through Standardized Operating Procedures, and ChatDev [@qian2024chatdev] implements chat-chain architectures with communicative dehallucination, both showing that explicit institutional constraints outperform individual agent prompting for organizational coherence. Strategic dialogue systems further test social dynamics: CraigslistBargain [@he2018craigslist] decoupled strategy from generation, NegotiationArena [@bianchi2024negotiationarena] quantifies irrational behaviors, the Consensus Game [@jacob2024consensus] formalizes LM decoding as equilibrium search, and the Game-theoretic LLM framework [@hua2024gametheoreticllm] incorporates backward induction into agent workflows.

AI-for-Science Systems
----------------------

#### Neural dynamics and interpretability.

DyNeMo [@gohil2022dynemo; @khan2023dynemoc] combines an encoder mapping observations to latent network modes with a memory model capturing their temporal evolution, forming a generative dynamical system. With this structure, DyNeMo supports forward simulation of future latent states and prediction of neural responses to external interventions [@helfrich2014entrainment; @ngo2013auditory] via in-silico simulation rather than direct experimentation. However, unlike physical systems where governing laws are well established, the dynamics of large-scale neural activity remain largely unknown, shifting the primary scientific objective toward interpretable mechanism discovery. DyNeMo facilitates this by learning structured and interpretable latent representations that capture spatial patterns of functional brain networks, whose temporal statistics reveal higher-level organizational principles including structured cycles in network activations [@van2025large]. This highlights a distinct role of scientific world models: not only simulating known dynamics, but discovering the state space and transition structure themselves through interpretable representations and their statistical regularities.

#### Operator learning and molecular surrogates.

The neural operator framework [@kovachki2023neuraloperator] provides a unified theoretical foundation for learning maps between infinite-dimensional function spaces, establishing approximation theory and error bounds that underpin FNO, DeepONet, and subsequent architectures. PINO [@li2021pino] combines the neural operator architecture with physics-informed PDE residual losses, enabling zero-shot super-resolution and improved generalization under sparse data. PI-DeepONet [@goswami2022pideepone] extends the DeepONet framework with physics-informed training, embedding governing PDE residuals directly into the operator learning objective. SchNet [@schutt2017schnet] introduced continuous-filter convolutions for molecular graphs, enabling end-to-end learning of quantum chemical properties without handcrafted features and serving as the architectural precursor to equivariant GNN potentials. For a comprehensive treatment of ML approaches to molecular simulation, including neural potentials, coarse-grained models, and generative sampling, see @noe2020mlmolsim. Boltzmann Generators [@noe2019boltzmann] pioneered deep generative models for sampling thermodynamic equilibrium states of molecular systems, bypassing the sequential bottleneck of traditional molecular dynamics. ClimaX [@nguyen2023climax] introduced the foundation-model paradigm for weather and climate, pretraining on CMIP6 reanalysis data with self-supervised learning and fine-tuning to both forecasting and climate projection tasks.

Illustrative L3 Evolution Loops {#app:l3_examples}
===============================

This appendix provides illustrative worked examples showing how the L3 evolution loop operates in each governing-law regime. These are constructive illustrations, not reports of deployed system behavior.

Dynamics Model Revision from Grasp Failure
------------------------------------------

A robot repeatedly drops an object despite the planner predicting success.

1.  *Anomaly*: force/torque sensor logs and contact timestamps reveal systematic deviation from predicted friction.

2.  *Attribution*: the dynamics model underestimates friction for smooth cylindrical objects.

3.  *Revision*: update the friction prior; add a pre-grasp diagnostic squeeze to estimate friction online; create a regression test with 5 material variants.

4.  *Validation*: replay across RoboCasa [@nasiriany2024robocasa] tasks to confirm the update does not degrade other material classes.

Strategy Revision in a Service Environment
------------------------------------------

A conversational agent in a simulated service environment consistently fails to retain users who threaten to cancel, despite the societal world model predicting that a discount would resolve the conversation.

1.  *Anomaly*: across 50 cancellation dialogues, the discount offer succeeds only 20% of the time, far below the model's predicted 70% retention rate.

2.  *Attribution*: the social model assumes users who threaten cancellation are price-sensitive, but conversation analysis reveals that most are frustrated by service quality, not price. The model's belief attribution is systematically wrong.

3.  *Revision*: update the user-intent classifier to distinguish price-sensitive from quality-frustrated users; for quality-frustrated users, replace the discount strategy with an acknowledgment-and-escalation strategy; add a regression test ensuring price-sensitive users still receive discounts.

4.  *Validation*: A/B test the revised strategy on the next 100 cancellation interactions, measuring retention rate, user satisfaction, and whether the revision introduces new failure modes (e.g., unnecessary escalations for genuinely price-sensitive users).

Installation Failure Producing Reusable Skill
---------------------------------------------

An agent runs \`\`install dependencies," encounters an error, and retries fruitlessly.

1.  *Anomaly*: error codes and permission status reveal a `permission denied` failure class.

2.  *Attribution*: insufficient privileges, not a typo or network error.

3.  *Revision*: create a skill template (on permission errors: check privileges $\rightarrow$ choose alternative install path $\rightarrow$ re-verify), plus a regression test in a restricted-permission environment.

4.  *Validation*: replay across OSWorld and macOSWorld variants [@xie2024osworld; @yang2025macosworld] to confirm that the extracted skill transfers beyond the original failure case.

Closed-Loop Materials Discovery
-------------------------------

A Bayesian surrogate predicts a specific crystal phase, but synthesis yields an unexpected mixed phase.

1.  *Anomaly*: XRD diffraction pattern deviates from the predicted single-phase output.

2.  *Attribution*: the surrogate underweights the effect of synthesis temperature on phase stability.

3.  *Revision*: update the Bayesian model with the new data point; expand the hypothesis space to include temperature-dependent phase boundaries; design a follow-up experiment at an intermediate temperature.

4.  *Validation*: the next synthesis confirms the revised prediction, and calibration improves on held-out compositions [@kusne2020cameo; @szymanski2023alab].

Additional Details of Evaluation and Benchmarks {#app:eval_extended}
===============================================

This appendix provides detailed evaluation protocols for the three L2 boundary conditions, along with world-model-specific benchmarks, capability coverage analysis, and the Minimal Reproducible Evaluation Package (MREP) introduced in Section `\ref{sec:evaluation}`{=latex} of the main text.

Long-Horizon Coherence
----------------------

Long-horizon coherence asks whether composed predictions $\hat p(\tau\mid z_0,a_{1:H},c)$ remain usable as the rollout horizon $H$ grows. The signature failure is **compounding error**: small per-step deviations amplify over time, pushing imagined trajectories into unreachable branches (Section `\ref{subsec:l2_failure_modes}`{=latex}, failure mode 1). A secondary failure is **state aliasing** (failure mode 2), where distinct real states collapse into similar representations, causing the rollout to silently diverge from reality.

Operationally, coherence is measured by tracking a task-relevant metric as a function of horizon. For physical manipulation, RoboCasa [@nasiriany2024robocasa] and ManiSkill3 [@tao2024maniskill3] offer multi-step tasks where success rate degrades predictably with horizon; the degradation curve itself is the diagnostic. SWE-bench [@jimenez2023swebench] poses an analogous challenge for code: multi-file resolution rate reveals whether the model maintains a coherent codebase state across $k$ interdependent editing steps. Social settings introduce a different flavour of drift. Sotopia [@zhou2024sotopia] tracks whether commitments, alliances, and relational variables remain stable across multi-turn interactions rather than silently eroding. Scientific reasoning demands another form of coherence: in ScienceWorld [@wang2022scienceworld], sequences of lab actions must preserve causal ordering (e.g., heating a substance before measuring its temperature, not after).

A significant gap in current practice is that most benchmarks report fixed-horizon success rates without degradation curves: a system is tested at horizon $H$ but the relationship between performance and $H$ is not characterized. Techniques from long-form video understanding, such as temporal search methods that selectively retrieve and reason over distant segments [@ye2025re], suggest a promising direction for diagnosing coherence at scale. Open-world environments provide the strongest existing tests of long-horizon coherence precisely because they demand skill composition across hundreds of steps. Minecraft tasks (via MCU [@zheng2023mcu] or Voyager [@wang2023voyager]) require gathering resources, crafting tools, and navigating terrain in sequences that can exceed a thousand actions. Crafter [@stanic2023learning] compresses similar compositional demands into a more controlled setting, while NetHack [@kurenkov2023katakomba] adds procedural generation and symbolic complexity. In model-based RL, MBPO-style horizon truncation [@janner2019mbpo] implicitly acknowledges coherence limits by restricting rollout length to a range where the model remains accurate, but this adaptive truncation is rarely evaluated as a coherence metric in its own right.

Intervention Sensitivity
------------------------

Intervention sensitivity asks whether the rollout responds meaningfully to changes in the action sequence $a_{1:H}$ or the initial condition $z_0$. A model that produces the same trajectory regardless of the action taken is useless for planning, even if each individual frame is perceptually realistic. This directly tests **controllability** (Section `\ref{subsec:l2_failure_modes}`{=latex}, failure mode 3) [@wu2024ivideogpt; @brooks2024sora].

The core protocol is *counterfactual divergence testing*: from the same $z_0$, execute two action sequences $a_{1:H}$ and $a'_{1:H}$ that differ at a single step, and measure whether the resulting trajectories $\tau$ and $\tau'$ diverge in a task-relevant way. Two complementary metrics capture this: the *action sensitivity ratio* (fraction of action perturbations that produce a detectable outcome change) and the *counterfactual outcome divergence* (magnitude of difference in task-relevant variables between $\tau$ and $\tau'$, normalized by the action change magnitude). In OSWorld [@xie2024osworld], this corresponds to injecting pop-up interruptions or network failures and testing whether the agent replans rather than clicking blindly. In RoboCasa, it corresponds to perturbing object placement and verifying that the manipulation strategy adapts. In Sotopia, it corresponds to changing one agent's opening move and checking whether the negotiation outcome shifts accordingly.

In current evaluation practice, intervention sensitivity receives markedly less attention than the other two boundary conditions. Most benchmarks measure output quality (success rate, perceptual fidelity) but do not explicitly test whether model predictions change appropriately with actions. Closing this gap requires evaluation protocols that explicitly vary actions and measure outcome divergence beyond output quality.

Constraint Consistency
----------------------

Constraint consistency asks whether rollouts respect the governing laws $c(\tau)$ of the target regime (Section `\ref{subsec:l2_requirements}`{=latex}, condition 3). Because $c(\tau)$ depends on entire trajectory, violations are often invisible to per-step metrics but catastrophic for planning. This condition is also the primary surface for **exploitability** (failure mode 4) [@xie2024osworld; @zheng2023mcu] and **calibration failure under distribution shift** (failure mode 5).

Verification methods vary with the governing law. For physics, the key signals are penetration depth, energy conservation violation, and support-relation consistency; VBench [@huang2023vbench] decomposes video generation quality into fine-grained physical compliance checks, while BuilderBench [@ghugare2025builderbench] tests structural stability under physical load. Software environments foreground receipt match rate, type-constraint satisfaction, and API contract adherence, with OSWorld and macOSWorld [@yang2025macosworld] providing loggable receipt streams. Social simulation raises norm violation detection rate, commitment consistency, and Theory of Mind accuracy, assessed through Sotopia's seven-dimensional framework [@zhou2024sotopia] and ExploreToM's adversarial probing [@sclar2024exploretom]. Scientific domains use conservation law satisfaction, causal graph consistency, and evidence-chain validity; DiscoveryBench [@majumder2024discoverybench] and FutureX [@zeng2025futurex] test evidence grounding. Mobile and cross-platform environments (AppAgent [@zhang2025appagent], AndroidWorld [@rawles2024androidworld]) extend digital-world constraint testing, while AgentBench [@liu2023agentbench] provides cross-domain breadth. ChemCrow [@bran2024augmenting] evaluates chemical synthesis under strict validity constraints where a single violation renders the entire plan invalid.

World-Model-Specific Evaluation
-------------------------------

WorldSimBench [@qin2025worldsimbench] introduces a dual evaluation (perceptual quality and manipulative capability) for video generation models used as world simulators, revealing that perceptual realism and action-conditioned fidelity can diverge sharply. WorldModelBench [@fan2025worldmodelbench] spans 7 domains, 56 subdomains, and 350 prompts with 67K human labels, finding widespread constraint consistency gaps across all 14 frontier models tested. @vafa2024evaluating apply the Myhill-Nerode theorem to show that passing per-step tests does not guarantee long-horizon coherence: models can maintain highly incoherent internal world states while performing well on standard diagnostics. @kang2025howfar find that video generation models perform \`\`case-based" mimicry rather than learning generalizable physical principles, even at scale.

Capability Coverage Matrix
--------------------------

No single benchmark covers all three boundary conditions and four governing-law regimes; researchers should explicitly map their chosen evaluation suites to these axes to avoid over-claiming generalization.

Minimal Reproducible Evaluation Package (MREP)
----------------------------------------------

The lack of standardization in agent evaluation leads to non-comparable results and \`\`leaderboard hacking" [@henderson2018deep]. We propose the MREP as a community standard:

1.  **Version Locking:** Exact commit hashes for the environment and task set definition.

2.  **Trace Logs:** Full logs of intermediate steps, including observations, actions, and receipts, sufficient to replay and attribute failures post hoc.

3.  **Failure Taxonomy:** Automated classification aligned with our five L2 failure categories.

4.  **Tail Statistics:** Stratified bootstrap confidence intervals, IQM, and performance profiles rather than point estimates [@agarwal2021statistical].

5.  **Boundary Condition Mapping:** Explicit declaration of the boundary conditions tested.

Several MREP components already have tooling support (trace logging via LangSmith/W&B Weave; version locking via containerization; tail statistics per @agarwal2021statistical). The components requiring new infrastructure are automated failure taxonomy and boundary condition mapping. The MREP framework connects directly to L3 governed validation (Section `\ref{subsec:l3_definition}`{=latex}): the evaluation assets MREP requires are precisely what an L3 gatekeeper needs to decide whether a model update should be promoted or rolled back.

Implementation Details and Efficient Deployment {#app:impl_extended}
===============================================

This appendix provides extended details on the practical implementation considerations and efficiency techniques for world-model systems that were summarized in Section `\ref{sec:implementation}`{=latex} of the main text.

Practical Implementation Considerations
---------------------------------------

#### Training paradigms: end-to-end vs. modular.

Two dominant strategies exist. In *end-to-end* training, the encoder, dynamics model, decoder, and policy are optimized jointly through a shared objective such as the ELBO or a value-aware loss. The Dreamer family [@hafner2019dreamer; @hafner2020dreamerv2; @hafner2023dreamerv3] exemplifies this: all components share gradients, and the policy trains entirely on imagined latent rollouts. In *modular* training, each component is trained with its own objective: representation via self-supervised learning [@assran2023ijepa; @bardes2024vjepa], dynamics via maximum likelihood or TD objectives [@hansen2024tdmpc2], and policy via model-free RL or planning. Modular development frameworks such as StarVLA [@starvla2025] provide composable codebases for systematic ablation and recombination. End-to-end avoids cross-module error propagation while unstable; modular systems are easier to debug but risk mismatched interfaces.

#### Latency, compute, and the O(1) argument.

The choice of dynamics model is tightly constrained by the latency budget. Robotics control loops typically require sub-100 ms inference, favoring lightweight latent dynamics with short-horizon rollouts [@hansen2024tdmpc2; @chua2018pets]. Web and OS agents tolerate seconds-scale latency, allowing richer search and receipt parsing. High-fidelity generative models [@brooks2024sora] may take minutes per rollout, restricting them to offline planning or data augmentation.

A fundamental computational argument underlies the appeal of learned world models: a neural forward pass runs in $O(1)$ time with respect to the complexity of the simulated system, whereas explicit simulation scales as $O(N)$ or worse. This $O(1)$ property makes learned world models viable where analytic simulation is intractable. However, the constant factor matters: the $O(1)$ forward pass of a large diffusion model may still be slower than the $O(N)$ pass of a lightweight physics engine for modest $N$.

World-model scaling laws also differ from LLM scaling laws. Language models must memorize vast factual knowledge; world models primarily need to capture transition structure. This filtering-and-organizing role suggests a different compute-optimality regime where architectural inductive biases may substitute for raw parameter count more effectively than in language modeling.

#### Sim-to-real transfer.

For embodied systems, the gap between the training simulator and deployment is a persistent bottleneck. Domain randomization [@tobin2017domain] remains widely used. Complementary strategies include system identification, progressive transfer, and hybrid approaches combining learned residual dynamics with analytic physics models [@nagabandi2018mbmf]. DayDreamer [@wu2023daydreamer] demonstrated that Dreamer-style latent imagination can transfer to physical robots by training on real sensor data collected online, sidestepping the sim-to-real gap at the cost of slower data collection.

#### Error handling and graceful degradation.

A mature world-model system must degrade gracefully when predictions become unreliable. MBPO [@janner2019mbpo] limits rollout length to the accurate regime, falling back to real data for longer horizons. Ensemble disagreement [@chua2018pets] provides a practical signal for triggering replanning or escalation. In software environments, receipt parsing serves an analogous role: unexpected error codes signal that assumptions may be violated [@xie2024osworld].

Few-Step Distillation for Generative Dynamics
---------------------------------------------

Generative simulation systems driven by diffusion and flow-matching frameworks are inherently constrained by iterative denoising latency. Initial efforts relied on training-free, high-order ODE solvers [@dockhorn2022genie; @karras2022elucidating; @lu2022dpm; @lu2025dpm; @sabour2024align; @zhang2022fast; @zheng2023dpm], but these fall short of the low-step budgets demanded by real-time agents.

The field has pivoted toward few-step distillation. Early strategies focused on compressing teacher model trajectories by matching long-stride transitions [@lipman2022flow; @salimans2022progressive]. This paradigm evolved into Consistency Models [@geng2024consistency; @lu2024simplifying; @song2023improved; @song2023consistency], which circumvent iterative sampling by learning a direct PF-ODE mapping from noise to clean data. Flow-map models further generalize this concept [@boffi2024flow; @frans2024one; @heek2024multistep; @kim2023consistency; @wang2024phased]. Large-scale pre-training initiatives like TiM [@wang2025transition] and MeanFlow [@geng2025mean] have advanced this approach. Distribution-matching distillation [@salimans2024multistep; @sauer2024adversarial; @sauer2024fast; @yin2024improved; @zhou2024score; @yu2025self] has emerged as an alternative, aligning student output with teacher target distributions.

As world models increasingly rely on video generation, the higher sampling costs of video have catalyzed adaptation of acceleration techniques to the spatiotemporal domain [@ding2025dollar; @lin2025diffusion; @zhang2024sf; @zheng2025large; @nie2026transition; @yang2025longlive]. These efficiency gains transform generative models from offline simulators into viable engines for real-time planning.

Model Compression: Quantization and Pruning
-------------------------------------------

Quantization maps high-precision weights into lower-bit formats (e.g., INT/FP-8, INT/FP-4), reducing memory bandwidth constraints. Pruning removes redundant parameters or structural blocks. For world modeling, the primary challenge is mitigating compounding errors: even minor quantization noise or pruning-induced degradation can lead to severe semantic drift over long horizons.

Foundational post-training quantization techniques [@frantar2023gptq; @lin2024awq; @huang2026mcsharp; @huang2024billm; @dettmers2022gpt3int8; @huang2024mixture] were designed for LLMs but are applicable to both LLM transformers and DiTs. SqueezeLLM [@kim2024squeezellm] combines sensitivity-based non-uniform quantization with dense-and-sparse decomposition for ultra-low precision. On the serving side, efficient memory management is equally critical: vLLM [@kwon2023vllm] introduced PagedAttention, which manages KV cache memory in non-contiguous blocks analogous to OS virtual memory, dramatically improving throughput for large-model inference. For diffusion models, QAT methods like QDM [@li2024q] and TerDiT [@lu2024terdit] maintain performance at 1-2 bit precision but require substantial training overhead. PTQ approaches for UNet-based diffusion models include QDiffusion [@li2023q], PTQ4DM [@shang2023post], and EfficientDM [@he2023efficientdm]. For transformer backbones, Q-DiT [@chen2025q], PTQ4DiT [@wu2024ptq4dit], SVDQuant [@li2024svdquant], and ViDiTQ [@zhao2024vidit] account for the unique activation distributions of diffusion transformers through attention-aware calibration.

Network pruning includes unstructured approaches [@dong2017learning; @park2020lookahead; @sanh2020movement; @lee2019signal] and structured pruning [@ding2019centripetal; @liu2021group]. Token merging [@bolya2023token; @bolya2022token] provides training-free alternatives. For world models, rollout-aware pruning is a promising frontier: pruning criteria must preserve the parameters that are critical for long-horizon coherence, aligning with the L2 boundary condition of temporal consistency defined in Section `\ref{subsec:l2_requirements}`{=latex}.

Memory and KV Cache Compression
-------------------------------

Autoregressive token dynamics are severely memory-bound during long-horizon rollouts as the KV cache grows linearly. Key compression strategies include:

1.  **Token eviction:** heavy-hitter retention [@zhang2023h2o] and attention-sink preservation [@xiao2024streamingllm] discard low-salience entries to bound cache size.

2.  **Chunk-level autoregressive generation:** modern video models generate in chunks [@yin2025slow; @huang2025selfforcing; @feng2025streamdiffusionv2], though hardware constraints often limit output to ${\sim}60$ seconds.

3.  **KV quantization:** schemes such as KIVI [@liu2024kivi], KVQuant [@hooper2024kvquant], QuaRot [@ashkboos2024quarot], and RotateKV [@su2025rotatekv] are mature for LLM serving, but porting them to video diffusion causes severe quality loss due to different activation statistics.

4.  **Spatiotemporal-aware compression:** effective video KV compression requires frameworks that explicitly leverage video-specific spatiotemporal redundancy [@yang2025sparse].

[^1]: <https://www.moltbook.com/>