---
abstract: |
  World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the *right latent space* becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel *reconstruction*, recent work suggests benefits from pretrained encoders with representation-aligned *semantic* latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.
author:
- |
  Nilaksh$^{\textbf{*}1,2,3}$ Saurav Jha$^{\textbf{*}1,2,3}$ Artem Zholus$^{\textbf{*}1,2,3}$ Sarath Chandar$^{1,2,3,4}$\
  $^{1}$Chandar Research Lab $^{2}$Mila -- Quebec AI Institute $^{3}$Polytechnique Montréal $^{4}$Canada CIFAR AI Chair\
  $^*$Equal Contribution\
  Correspondence: `[nilaksh.nilaksh, saurav.jha]@mila.quebec`\
  <https://hskalin.github.io/semantic-wm/>\
  <https://huggingface.co/Nilaksh404/semantic-wm>
bibliography:
- neurips\_2026.bib
title: 'Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models '
---

```{=latex}
\newcommand{\@workshoptitle}{}
```
```{=latex}
\newcommand{\workshoptitle}[1]{\renewcommand{\@workshoptitle}{#1}}
```
```{=latex}
\renewcommand{\rmdefault}{ptm}
```
```{=latex}
\renewcommand{\sfdefault}{phv}
```
```{=latex}
\newcommand{\@neuripsordinal}{39th}
```
```{=latex}
\newcommand{\@neuripsyear}{2025}
```
```{=latex}
\newcommand{\@neuripslocation}{San Diego}
```
```{=latex}
\newcommand{\acksection}{\section*{Acknowledgments and Disclosure of Funding}}
```
```{=latex}
\renewcommand{\normalsize}{%
  \@setfontsize\normalsize\@xpt\@xipt
  \abovedisplayskip      7\p@ \@plus 2\p@ \@minus 5\p@
  \abovedisplayshortskip \z@ \@plus 3\p@
  \belowdisplayskip      \abovedisplayskip
  \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
}
```
```{=latex}
\renewcommand{\small}{%
  \@setfontsize\small\@ixpt\@xpt
  \abovedisplayskip      6\p@ \@plus 1.5\p@ \@minus 4\p@
  \abovedisplayshortskip \z@  \@plus 2\p@
  \belowdisplayskip      \abovedisplayskip
  \belowdisplayshortskip 3\p@ \@plus 2\p@   \@minus 2\p@
}
```
```{=latex}
\renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
```
```{=latex}
\renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
```
```{=latex}
\renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
```
```{=latex}
\renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
```
```{=latex}
\renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
```
```{=latex}
\renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
```
```{=latex}
\renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
```
```{=latex}
\renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
```
```{=latex}
\providecommand{\section}{}
```
```{=latex}
\renewcommand{\section}{%
  \@startsection{section}{1}{\z@}%
                {-2.0ex \@plus -0.5ex \@minus -0.2ex}%
                { 1.5ex \@plus  0.3ex \@minus  0.2ex}%
                {\large\bf\raggedright\color{wmBest}}%
}
```
```{=latex}
\providecommand{\subsection}{}
```
```{=latex}
\renewcommand{\subsection}{%
  \@startsection{subsection}{2}{\z@}%
                {-1.8ex \@plus -0.5ex \@minus -0.2ex}%
                { 0.8ex \@plus  0.2ex}%
                {\normalsize\bf\raggedright\color{wmBest}}%
}
```
```{=latex}
\providecommand{\subsubsection}{}
```
```{=latex}
\renewcommand{\subsubsection}{%
  \@startsection{subsubsection}{3}{\z@}%
                {-1.5ex \@plus -0.5ex \@minus 0.2ex}%
                { 0.5ex \@plus  0.2ex}%
                {\normalsize\bf\raggedright}%
}
```
```{=latex}
\providecommand{\paragraph}{}
```
```{=latex}
\renewcommand{\paragraph}{%
  \@startsection{paragraph}{4}{\z@}%
                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
                {-1em}%
                {\normalsize\bf}%
}
```
```{=latex}
\providecommand{\subparagraph}{}
```
```{=latex}
\renewcommand{\subparagraph}{%
  \@startsection{subparagraph}{5}{\z@}%
                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
                {-1em}%
                {\normalsize\bf}%
}
```
```{=latex}
\providecommand{\subsubsubsection}{}
```
```{=latex}
\renewcommand{\subsubsubsection}{%
  \vskip5pt{\noindent\normalsize\rm\raggedright}%
}
```
```{=latex}
\renewcommand{\topfraction      }{0.85}
```
```{=latex}
\renewcommand{\bottomfraction   }{0.4}
```
```{=latex}
\renewcommand{\textfraction     }{0.1}
```
```{=latex}
\renewcommand{\floatpagefraction}{0.7}
```
```{=latex}
\renewenvironment{table}
  {\setlength{\abovecaptionskip}{\@neuripsbelowcaptionskip}%
   \setlength{\belowcaptionskip}{\@neuripsabovecaptionskip}%
   \@float{table}}
  {\end@float}
```
```{=latex}
\renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
```
```{=latex}
\def\@listi  {\leftmargin\leftmargini}
```
```{=latex}
\def\@listii {\leftmargin\leftmarginii
              \labelwidth\leftmarginii
              \advance\labelwidth-\labelsep
              \topsep  2\p@ \@plus 1\p@    \@minus 0.5\p@
              \parsep  1\p@ \@plus 0.5\p@ \@minus 0.5\p@
              \itemsep \parsep}
```
```{=latex}
\def\@listiii{\leftmargin\leftmarginiii
              \labelwidth\leftmarginiii
              \advance\labelwidth-\labelsep
              \topsep    1\p@ \@plus 0.5\p@ \@minus 0.5\p@
              \parsep    \z@
              \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
              \itemsep \topsep}
```
```{=latex}
\def\@listiv {\leftmargin\leftmarginiv
              \labelwidth\leftmarginiv
              \advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listv  {\leftmargin\leftmarginv
              \labelwidth\leftmarginv
              \advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listvi {\leftmargin\leftmarginvi
              \labelwidth\leftmarginvi
              \advance\labelwidth-\labelsep}
```
```{=latex}
\providecommand{\maketitle}{}
```
```{=latex}
\renewcommand{\maketitle}{%
  \par
  \begingroup
    \renewcommand{\thefootnote}{\fnsymbol{footnote}}
    % for perfect author name centering
    \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
    % The footnote-mark was overlapping the footnote-text,
    % added the following to fix this problem               (MK)
    \long\def\@makefntext##1{%
      \parindent 1em\noindent
      \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
    }
    \thispagestyle{empty}
    \@maketitle
    \@thanks
    \@notice
  \endgroup
  \let\maketitle\relax
  \let\thanks\relax
}
```
```{=latex}
\newcommand{\@toptitlebar}{
  \hrule height 4\p@
  \vskip 0.25in
  \vskip -\parskip%
}
```
```{=latex}
\newcommand{\@bottomtitlebar}{
  \vskip 0.29in
  \vskip -\parskip
  \hrule height 1\p@
  \vskip 0.09in%
}
```
```{=latex}
\providecommand{\@maketitle}{}
```
```{=latex}
\renewcommand{\@maketitle}{%
  \vbox{%
    \hsize\textwidth
    \linewidth\hsize
    \vskip 0.1in
    \@toptitlebar
    \centering
    {\LARGE\bf \@title\par}
    \@bottomtitlebar
    \if@anonymous
      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}
        Anonymous Author(s) \\
        Affiliation \\
        Address \\
        \texttt{email} \\
      \end{tabular}%
    \else
      \def\And{%
        \end{tabular}\hfil\linebreak[0]\hfil%
        \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
      }
      \def\AND{%
        \end{tabular}\hfil\linebreak[4]\hfil%
        \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
      }
      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
    \fi
    \vskip 0.3in \@minus 0.1in
  }
}
```
```{=latex}
\newcommand{\ftype@noticebox}{8}
```
```{=latex}
\newcommand{\@notice}{%
  % give a bit of extra room back to authors on first page
  \enlargethispage{2\baselineskip}%
  \@float{noticebox}[b]%
    \footnotesize\@noticestring%
  \end@float%
}
```
```{=latex}
\renewenvironment{abstract}%
{%
  % \vskip 0.075in%
  % \vskip 0.00in%
  % \centerline%
  % {\small\bf\sffamily Abstract}%
  % \vspace{0.5ex}%
  \vspace{-5pt}
  {\small\bf\sffamily Abstract}
  % \begin{quote}%
}
{
  \par%
  % \end{quote}%
  \vskip 5ex%
}
```
```{=latex}
\newcommand{\answerYes}[1][]{\textcolor{blue}{[Yes] #1}}
```
```{=latex}
\newcommand{\answerNo}[1][]{\textcolor{orange}{[No] #1}}
```
```{=latex}
\newcommand{\answerNA}[1][]{\textcolor{gray}{[NA] #1}}
```
```{=latex}
\newcommand{\answerTODO}[1][]{\textcolor{red}{\bf [TODO]}}
```
```{=latex}
\newcommand{\justificationTODO}[1][]{\textcolor{red}{\bf [TODO]}}
```
```{=latex}
\newcommand{\@noticestring}{%
    % Preprint.%
  }
```
```{=latex}
\renewcommand\Authfont{\normalfont\sffamily\bfseries\fontsize{8}{12}\selectfont}
```
```{=latex}
\renewcommand\Affilfont{\normalfont\fontsize{8}{10}\selectfont}
```
```{=latex}
\renewcommand\AB@affilsepx{, \protect\Affilfont}
```
```{=latex}
\renewcommand\Authands{ and }
```
```{=latex}
\renewcommand{\@maketitle}{%
    \vbox{%
      \hsize\textwidth
      \linewidth\hsize
      \vskip -0.32in
      % \noindent\rule{\linewidth}{0.8pt}
      \begin{flushleft}
        \vskip -3pt
        % \vskip 8pt
        % \vskip -15pt
        {\raggedright \normalfont\huge\bfseries\selectfont \textcolor{mila-purple}{\@title}\par}%
        \vskip11pt
        {\raggedright \@author\par}%
        \vskip-5pt
        % \hline
        \noindent\rule{\linewidth}{1pt}
        % \vskip20pt
        % \vskip-5pt
      \end{flushleft}
    }%
  }
```
```{=latex}
\renewcommand{\headrulewidth}{0.8pt}
```
```{=latex}
\renewcommand{\footrulewidth}{0pt}
```
```{=latex}
\renewcommand{\maketitle}{%
    \par
    \begingroup
      % \renewcommand{\thefootnote}{\fnsymbol{footnote}}
      % \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
      % \long\def\@makefntext##1{%
      %   \parindent 1em\noindent
      %   \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
      % }
      \thispagestyle{neuripsfirstpagelogo}
      % \thispagestyle{neurips@first}
      \@maketitle
      \@thanks
      \@notice
    \endgroup
    \let\maketitle\relax
    \let\thanks\relax
  }
```
```{=latex}
\newcommand{\@noticestring}{
      \@trackname
    }
```
```{=latex}
\newcommand{\@noticestring}{%
      Submitted to \@neuripsordinal\/ Conference on Neural Information
      Processing Systems (NeurIPS \@neuripsyear). Do not distribute.%
    }
```
```{=latex}
\let\ack\hide
```
```{=latex}
\let\endack\endhide
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand{\best}[1]{\textbf{\textcolor{wmBest}{#1}}}
```
```{=latex}
\newcommand{\second}[1]{\underline{\textcolor{wmSecond}{#1}}}
```
```{=latex}
\newcommand{\gap}[1]{{\scriptsize\textcolor{wmGap}{(\,#1\,)}}}
```
```{=latex}
\newcommand{\groupsep}{\hdashline[1pt/2pt]}
```
```{=latex}
\newcommand{\famtag}[2]{{\scriptsize\textcolor{c#1}{#2}}}
```
```{=latex}
\newcommand{\rowhead}[1]{\textsc{#1}}
```
```{=latex}
\newcommand{\uncertainty}[1]{{\scriptsize\textcolor{wmGap}{#1}}}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\def\huggingface{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{figures/hf-logo.pdf}}}
```
```{=latex}
\def\github{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{figures/github-logo.pdf}}}
```
Introduction
============

Action-conditioned video world models are emerging as a practical interface between generative modeling and robotics [@ha2018world; @yang2023learning; @brooks2024video]. Given observation and action histories, they predict future observations and serve as learned proxies for robot-environment interaction when handcrafted simulators are difficult to build [@todorov2012mujoco; @erez2015simulation]. Recent works show that such models can support policy evaluation with good correlation to real-world outcomes [@tseng2025scalable], and policy improvement [@zhu2025wmpo; @zhang2025reinforcing; @sharma2026world]. Yet current evaluations say little about which representation makes a world model faithful to robotic dynamics.

[\[fig:pipeline-overview\]]{#fig:pipeline-overview label="fig:pipeline-overview"}

This question is increasingly important because many video world models are latent diffusion models (LDMs) [@vahdat2021score; @rombach2022high] that learn dynamics in an encoder-defined latent space. The standard choice is a reconstruction-aligned autoencoder, such as a VAE [@kingma2013auto] or recent variants [@esser2024scaling; @yao2025vavae; @agarwal2025cosmos], whose latents are optimized for pixel fidelity and stable decoding. But robotic world models are more than video generators, where planning and evaluations require predictions that preserve physical, spatial, and task dynamics. This motivates using the semantic spaces of self-supervised and vision-language encoders as latents for robot world modeling [@caron2021emerging; @oquab2023dinov2; @he2020momentum; @he2022masked; @midovjepa2; @radford2021learning; @tschannen2025siglip]. These spaces expose object layout and task structure more directly than pixel-trained autoencoders [@shi2026latent]. However, they are hard to use for diffusion due to their higher dimensionality yielding off-manifold latent generation with poor object structures [@zhang2025both]. RAE [@zheng2025diffusion] makes them more tractable with a dimension-dependent noise-schedule shift and a wide DDT head [@wang2025ddt], while S-VAE [@zhang2025both] learns a compact, KL-regularized latent space using an autoencoder as an adapter over the frozen semantic features.

Still, the effect of semantic latents on action-conditioned LDM for robotics remains open. DINO-WM [@zhouDINOWMWorldModels2025] and V-JEPA 2-AC [@midovjepa2] show that pretrained feature spaces support planning, but they are not diffusion models: DINO-WM is an autoregressive feature-prediction world model, while V-JEPA 2-AC is a JEPA predictor [@assran2023self]. RAE-NWM [@zhang2026rae] shows that DINOv2 [@oquab2023dinov2] spaces support diffusion-based navigation world modeling. Yet navigation differs from contact-rich manipulation, where gripper motion, object state, geometry, and policy rollouts all matter. This leads to our question: **what effects does latent space choice have for LDM-based robotic world modeling?**

We answer this with a controlled evaluation study that varies only the representation space in which the transition model operates (see Fig. [\[fig:pipeline-overview\]](#fig:pipeline-overview){reference-type="ref" reference="fig:pipeline-overview"}). For effective semantic space LDM training, we adapt RAE's wide-head and schedule-shift recipe [@zheng2025diffusion] alongside the compact S-VAE adapter [@zhang2025both], and train on the Bridge V2 dataset [@pmlr-v229-walke23a] with the same DiT transition model [@peebles2023scalable] and action-conditioning scheme. We then propose an evaluation suite spanning three axes: visual fidelity, planning and downstream policy performance, and latent quality. Our findings show that semantic latents improve action recoverability, task-success classification, CEM planning, and policy-in-the-loop success, while reconstruction latents mainly retain photometric advantages. Our key contributions are three-fold:

1.  Our primary contribution is the *evaluation* of representation spaces for latent diffusion world modeling. We do controlled analyses of how latent space choice affects not only visual generation, but also robotics tasks and robustness through our proposed three evaluation axes.

2.  We propose an effective recipe for *training diffusion world models in high dimensional semantic spaces*, by leveraging the recent advances in semantic space diffusion and extending them to action-conditioned world modeling. We also study the effects of different design choices.

3.  We show that semantic latent spaces are consistently more useful for policy evaluation and planning, even when reconstruction latents match or exceed them on low-level pixel fidelity, establishing that the best robotic world model latent space is the one that preserves action-relevant structure, not merely the one that reconstructs images the best.

Problem Formulation {#sec:formulation}
===================

We consider multi-task robot manipulation from partial observations. The offline dataset is $\mathcal{D}=\{(o_{0:T},\,a_{0:T-1},\,\ell,\,y)\}$, where $o_t \in \mathcal{O}$ is an RGB observation, $a_t \in \mathbb{R}^{d_a}$ is a continuous robot action, $\ell$ is an optional language instruction, and $y \in \{0,1\}$ denotes episode success. Tasks vary in object configurations and instructions, but share a robot embodiment; we therefore view the data as samples from related partially observed Markov Decision Processes with shared dynamics and task-dependent goals. Because a single observation does not generally determine the next observation under an action, we condition on a finite visual-action history of length $H$ and model the action-conditioned predictive distribution over a rollout horizon $K$: $p(o_{t+1:t+K} \mid o_{t-H:t},\, a_{t-H:t+K-1})$.

r0.31 ![image](figures/results/action_subspace_idm_vae_vjepa_warm_cool_32_46_25.png){width="30%"}

Latent Space World Models {#sec:latent_wm}
-------------------------

Rather than predicting future frames directly in pixel space, latent world models learn predictive dynamics in a representation space. Each model consists of a frozen encoder, an optional frozen adapter, an action-conditioned transition model, and a decoder.

#### Encoder and adapter.

A pretrained image encoder maps each observation to a spatial latent $z_t=f_\phi(o_t)\in\mathbb{R}^{N\times D}$, where $N=h\times w$ is the number of patches and $D$ is the encoder's native channel dimension. The encoder is frozen, so $f_\phi$ fixes the representation space in which dynamics are learned. For high-dimensional semantic representation encoders, we optionally use a frozen adapter $\alpha_\psi$ to obtain compact diffusion-friendly latents $\tilde{z}_t=\alpha_\psi(z_t)\in\mathbb{R}^{N\times d}$ [@zhang2025both]. For compressed reconstruction-aligned latent spaces, the adapter is the identity map.

#### Transition model.

An action-conditioned DiT [@peebles2023scalable] predicts future latent trajectories: $\tilde{z}_{t+1:t+K}\sim p_\theta(\cdot\mid \tilde{z}_{t-H:t},a_{t-H:t+K-1})$. Only the transition model is updated during world model training; the encoder, adapter, and decoder remain fixed. For semantic encoders without adapters, we add a lightweight wide DDT head  [@wang2025ddt], which adds few parameters but addresses the width bottleneck of DiT for high-dimensional latent spaces [@zheng2025diffusion]. Otherwise, variants share the same transition backbone and differ only in representation and decoding path. Table [\[tab:arch-param-compute\]](#tab:arch-param-compute){reference-type="ref" reference="tab:arch-param-compute"} (Appx. [10](#app:archi-and-training){reference-type="ref" reference="app:archi-and-training"}) shows that the DiT backbone with adapter does not incur an increase in parameter count or GFLOPs. Compute parity is explained in Appx. [9](#app:faq){reference-type="ref" reference="app:faq"}.

#### Decoder.

Predicted latents are mapped back to pixels as $\hat{o}_{t+1:t+K}=\mathrm{Dec}(\tilde{z}_{t+1:t+K})$. The decoder is needed for visual rollouts and pixel-level evaluation, but decoded image quality alone does not determine world model quality: a model may render plausible frames while missing action-relevant dynamics, or preserve control-relevant structure despite minor photometric errors.

The Role of the Latent Space in Robotics {#sec:latent_spaces}
----------------------------------------

The encoder-defined latent space determines the state representation on which the transition model $p_\theta$ learns dynamics. In LDM, reconstruction-aligned latents $z_t^{\mathrm{pix}}=f_\phi^{\textsc{pix}}(o_t)\in\mathbb{R}^{N\times D_{\mathrm{pix}}}$ are commonly used because they preserve pixel-level information and provide reliable decoders [@child2021very]. For robotic world models, however, the relevant state is not only what an image looks like, but how it changes under actions and whether those changes preserve task progress, object state, contact, and geometry. This creates a multi-objective problem where useful latents should be action-controllable, task-informative, visually decodable, and useful for planning or policy evaluation.

As an initial diagnostic, we use inverse dynamics model (IDM) to probe whether an encoder makes action-relevant change explicit in latent space (see Appx. [12.4](#app:latent-rep){reference-type="ref" reference="app:latent-rep"} for details). Figure [\[fig:action-traj\]](#fig:action-traj){reference-type="ref" reference="fig:action-traj"} shows that different encoders induce markedly different action-aligned trajectory geometries, suggesting that encoder choice changes which aspects of robot dynamics are easy for a transition model to learn. This motivates us to treat the latent space $f_\phi$ as the experimental variable, and evaluate its effect beyond visual fidelity and on axes spanning controllability, task semantics, and policy performance.

We thus compare reconstruction-aligned latents with semantic latents from pretrained vision foundation models [@oquab2023dinov2; @midovjepa2; @tschannen2025siglip], denoted as $z_t^{\mathrm{rep}}=f_\phi^{\textsc{rep}}(o_t)\in\mathbb{R}^{N\times D_{\mathrm{rep}}}$. Since $D_{\mathrm{rep}}$ is typically large, we evaluate both native features and compact adapter latents $\tilde{z}_t=\alpha_\psi(z_t^{\mathrm{rep}})$. We train one world model per candidate in $\Phi=\{f_\phi^{(1)},\ldots,f_\phi^{(m)}\}$ while fixing the data, history, action conditioning, optimizer, and transition backbone, so that each model learns a different latent transition $p_{\theta}^{(\phi)}(\tilde{z}_{t+1:t+K}\mid \tilde{z}_{t-H:t},a_{t-H:t+K-1})$. The decoder differences are controlled through reconstruction gap metrics, latent-space metrics, and planning metrics.

Experiments
===========

Dataset and Training
--------------------

#### Benchmark protocol.

We isolate the effect of the encoder-defined latent space by fixing the dataset, history length, action conditioning, transition architecture, optimizer, and training schedule, and varying only the encoder $f_\phi$, optional adapter $\alpha_\psi$, and decoder path. For each encoder--adapter pair, we train an LDM from scratch and evaluate the resulting world model for visual fidelity, representation quality, and downstream policy performance (see Appx. [10](#app:archi-and-training){reference-type="ref" reference="app:archi-and-training"}).

#### Dataset.

We train and evaluate on Bridge V2 [@pmlr-v229-walke23a], a real-robot manipulation dataset with $\approx$60K WidowX 250 demonstrations across 13 task families. Each episode includes RGB observations, 7 Degrees of Freedom (DoF) end-effector actions covering position, rotation, and gripper state, and a language instruction. For trajectory success classification, we use SOAR [@zhou2024autonomous] which contains roughly 30.5K success/failure class episodes for WidowX 250 with a 1:2 class split.

#### Encoder variants.

We compare two encoder families. reconstruction-aligned encoders $f_\phi^{\textsc{pix}}$ include: Stable Diffusion 3 (SD3) VAE [@esser2024scaling] with $D{=}16$, VA-VAE [@yao2025vavae] with $D{=}32$, and Cosmos [@agarwal2025cosmos] with $D{=}16$; for these, $\alpha_\psi \equiv \mathbb{I}$. Semantics-aligned encoders $f_\phi^{\textsc{rep}}$ include: V-JEPA 2.1 [@MurLabadia2026VJEPA2U] with $D{=}1024$, Web-DINO [@fan2025scaling], adapted from DINOv2 [@oquab2023dinov2], with $D{=}1024$, and SigLIP 2 [@tschannen2025siglip] with $D{=}1152$. For semantic encoders, we evaluate both native latents and compact latents from a pretrained S-VAE adapter [@zhang2025both], which maps $D{\to}d$ with $d{=}96$.

#### Adapter, decoder, and transition model.

The S-VAE adapter [@zhang2025both] is pretrained to reconstruct frozen encoder features with a KL-regularized loss, and is paired with a lightweight pixel decoder. All transition models are DiTs trained on Bridge V2 [@pmlr-v229-walke23a] with flow matching [@lipman2023flow]. Each DiT layer factorizes attention into a spatial block within each frame and a causal temporal block across frames. We sample every second frame, condition on $H{=}2$ history frames, and predict 8 future frames. We do not make use of language instruction conditioning while training the DiT. For all non-VAE encoders, we apply a dimension-dependent noise-schedule shift  [@esser2024scaling]. At inference, models roll out autoregressively one frame at a time using a 10-frame sliding context; VAE variants use their native pixel decoders, while semantic variants use the learned adapter decoder (see Appx. [10](#app:archi-and-training){reference-type="ref" reference="app:archi-and-training"} for details).

Evaluation Metrics {#sec:eval-metrics}
------------------

To study how the choice of latent representation propagates through to downstream tasks, we propose an evaluation suite that segregates this effect across three axes. See Appx. [11](#app:eval_metrics){reference-type="ref" reference="app:eval_metrics"} for details.

1.  **Planning and downstream policy performance.** For robotics applications, a latent world model should enable planning, *i.e.*, searching for the optimal action sequence given a goal state [@zhouDINOWMWorldModels2025; @midovjepa2]. Evaluating planning helps separate the latent world modeling performance from the pixel decoder performance, which visual metrics conflate together. Given a real $k$-step transition, we use the cross-entropy method (CEM) [@rubinstein2004cross] to recover the action sequence whose predicted latent best matches the target, and report CEM error at single-step $(k=1)$ and multi-step $(k=4)$ horizons.

    We also test whether the world model can serve as a policy-evaluation environment. We roll out OpenVLA-7B [@kim2025openvla] inside each world model on 20 Bridge V2 test episodes with 8 trials per episode, and a subset of 10 of these were used for Out-Of-Distribution (OOD) evaluations. We use two Vision-Language Models (VLMs): InternVL 3.5 [@wang2025internvl3] and Qwen 3.6 [@qwen3.6-27b], to judge the tasks' success. We report consensus success rate, Borda rank, and robustness under distractor-object and OOD-instruction perturbations. See Appx. [11](#app:eval_metrics){reference-type="ref" reference="app:eval_metrics"} for metrics definitions, Appx. [9](#app:faq){reference-type="ref" reference="app:faq"} regarding fairness of VLM ratings, and Appx. [11.4](#app:vla-eval-ood){reference-type="ref" reference="app:vla-eval-ood"} & [11.5](#app:vlm-judge){reference-type="ref" reference="app:vlm-judge"} for exact details about OOD frame and OOD instruction generation, as well as details about tasks.

2.  **Pixel fidelity and scene geometry.** Decoded rollouts must remain visually coherent to support visual policies. We report image/video metrics: FID, SSIM, LPIPS, FVD, temporal LPIPS, and point-track consistency, together with perceptual and geometric scores from WorldArena [@shang2026worldarena]. This family measures generation and motion quality, temporal consistency, and scene geometry.

3.  **Latent representation quality.** Because the transition model operates in latent space, we directly probe whether generated latents preserve action and task-relevant structure. We train an inverse dynamics model (IDM) [@tian2025predictive] on frozen encoder latents to recover action chunks for horizon $k{\in}\{1,4\}$, and apply the IDM to world model latents to measure generation-induced degradation. We train a classifier on latent trajectories of SOAR [@zhou2024autonomous], a language and success label annotated dataset of trajectories, to classify whether a trajectory was a success given the text instruction. We again measure the degradation in accuracy induced by evaluating on generated latents.

![Latent space utility](figures/results/encoder_tradeoff_dit_s_cem_warm_cool.png){#fig:dits-idm-metrics width="\\textwidth"}

![Visual utility](figures/results/pixel_tradeoff_ssim_fvd_flow_dit_s_warm_cool.png){#fig:dits-vis-metrics width="\\textwidth"}

![Policy performance](figures/results/vla_ood_tradeoff_vla_sr_ood_dist_ood_inst_dit_s_warm_cool.png){#fig:dits-policy-metrics width="\\textwidth"}

Findings
========

Does the choice of latent space affect planning and policy performance? {#sec:policy-perf}
-----------------------------------------------------------------------

#### Semantic latents offer better policy-in-the-loop performance.

Table [\[tab:policy-perf\]](#tab:policy-perf){reference-type="ref" reference="tab:policy-perf"} shows that encoder choice strongly affects downstream VLA policy rollouts at DiT-S. Reconstruction-aligned spaces perform worst: VAE and VA-VAE have the lowest consensus success rates and weakest Borda ranks, while semantic encoders improve policy success, interaction quality, and robustness. V-JEPA 2.1 and SigLIP 2 variants give the strongest DiT-S results. Semantic-family VLA SR and CEM outperform reconstruction-family under paired bootstrap over tasks as shown in our analysis in Appx. [12.3](#app:stat-analysis){reference-type="ref" reference="app:stat-analysis"}.

[\[tab:policy-perf\]]{#tab:policy-perf label="tab:policy-perf"}

#### Native semantic spaces preserve action geometry for planning.

Representation aligned spaces have the lowest action-recovery errors across all DiT backbone sizes (Table [\[tab:policy-perf\]](#tab:policy-perf){reference-type="ref" reference="tab:policy-perf"}, and Table [\[tab:policy-full\]](#tab:policy-full){reference-type="ref" reference="tab:policy-full"} in Appx [12](#app:additional-res){reference-type="ref" reference="app:additional-res"}). For example, at DiT-S V-JEPA 2.1 is best at $k{=}4$ and SigLIP 2 is best at $k{=}1$. Fig. [3](#fig:dits-policy-metrics){reference-type="ref" reference="fig:dits-policy-metrics"} likewise shows semantic encoders closer to the upper-right diagonal in the VLA--OOD plane, while VAE-family models fall lower and suffer larger distractor-induced drops.

#### Scaling narrows policy gaps but not action-centric gaps.

Appx. Table [\[tab:policy-full\]](#tab:policy-full){reference-type="ref" reference="tab:policy-full"} shows that For DiT-L, the gaps in VLA success and OOD robustness for VAE and Cosmos narrow relative to semantic encoders. We attribute this to improved visual fidelity at larger model size, which benefits the VLA policy. However, both still lag on CEM action recovery, which depends directly on latent transition structure rather than rendered visual quality; at DiT-L, VAE and Cosmos have larger $k{=}1$ CEM errors than all semantic encoders. They also lag on IDM $r$ and classifier accuracy (Table [\[tab:idm-pearson-k1-k4\]](#tab:idm-pearson-k1-k4){reference-type="ref" reference="tab:idm-pearson-k1-k4"} and [\[tab:success-probe-full\]](#tab:success-probe-full){reference-type="ref" reference="tab:success-probe-full"}).

Does the latent space affect action recoverability and preservation of task semantics?
--------------------------------------------------------------------------------------

r0.5

[\[tab:idm-suc-dits\]]{#tab:idm-suc-dits label="tab:idm-suc-dits"}

#### Semantic latents make action-relevant changes more recoverable.

Table [\[tab:idm-suc-dits\]](#tab:idm-suc-dits){reference-type="ref" reference="tab:idm-suc-dits"} shows that semantic encoders retain substantially more action information than reconstruction-aligned ones. On encoder latents, V-JEPA 2.1 and Web-DINO achieve the strongest IDM Pearson $r$ across both horizons, and this advantage largely persists after world model (WM) generation. The trends also hold with DiT scaling (Tables [\[tab:idm-pearson-k1-k4\]](#tab:idm-pearson-k1-k4){reference-type="ref" reference="tab:idm-pearson-k1-k4"} and [\[tab:success-probe-full\]](#tab:success-probe-full){reference-type="ref" reference="tab:success-probe-full"} in Appx. [12.4](#app:latent-rep){reference-type="ref" reference="app:latent-rep"}).

#### Semantic latents better preserve task-success information.

From Table [\[tab:idm-suc-dits\]](#tab:idm-suc-dits){reference-type="ref" reference="tab:idm-suc-dits"}, we also see that success classifiers trained on frozen encoder latents achieve higher accuracy for semantic encoders, and their performance degrades less when evaluated on generated WM latents, with SigLIP 2 having best WM latent accuracy. This indicates that semantic spaces not only encode local action effects, but also retain higher-level task progress signals useful for policy evaluation.

How does the latent space affect visual fidelity?
-------------------------------------------------

l0.32 ![image](figures/results/G_fig1c_noadpt_trajectory_full_main_ssim_only_warm_cool.png){width="32%"}

[\[fig:rollout-gap-trend\]]{#fig:rollout-gap-trend label="fig:rollout-gap-trend"}

#### Semantic latent spaces remain visually competitive.

Table [\[tab:visual-quality\]](#tab:visual-quality){reference-type="ref" reference="tab:visual-quality"} shows that the policy gains from semantic encoders do not come at the cost of decoded visual quality. At DiT-S scale, these encoders dominate most perceptual, structural, and video-level metrics, particularly when used with adapters $d_{96}$: SigLIP 2$_{96}$ gives the best SSIM, V-JEPA 2.1$_{96}$ gives the best FVD, and Web-DINO variants are strongest on JEPA similarity, subject consistency, depth error, and temporal LPIPS.

VAE-style spaces remain competitive on image quality, and qualitatively tend to preserve sharper local appearance details, but they lag behind semantic spaces on global structure and temporal generation quality. Figures [\[fig:rollout-gap-trend\]](#fig:rollout-gap-trend){reference-type="ref" reference="fig:rollout-gap-trend"} and [6](#fig:rollout-gap-full){reference-type="ref" reference="fig:rollout-gap-full"} (Appx. [12](#app:additional-res){reference-type="ref" reference="app:additional-res"}) show semantic space models have lower gap for pixel reconstruction, particularly while extrapolating beyond the 10-frame horizon length seen during training.

#### Large DiTs help recover much of the visual advantage of reconstruction latents.

Increasing transition model capacity benefits reconstruction latents the most. For DiT-L, VAE becomes highly competitive, achieving the best FID, image quality, aesthetic quality, JEPA similarity, depth error, dynamic degree, and FVD, while also ranking second on LPIPS and flow score. Here, semantic encoders still remain strong: V-JEPA 2.1$_{96}$ gives the best SSIM and LPIPS, and SigLIP 2$_{96}$ remains competitive on structure and temporal metrics, but their gains from scaling are less uniform. Overall, visual fidelity alone does not explain the downstream policy advantages observed in Sec. [4.1](#sec:policy-perf){reference-type="ref" reference="sec:policy-perf"}.

Does scaling along input views and model size help?
---------------------------------------------------

l0.62

![image](figures/results/multi_view_tradeoff_noadpt_warm_cool.png){width="\\linewidth"}

![image](figures/results/H5_fig_ssim_grouped_by_size_warm_cool.png){width="\\linewidth"}

#### Multi-view training improves action recovery but can hurt video quality under limited data.

We take the trained DiT-S models and finetune them for 20 epochs on the BridgeV2 episodes that contain three camera views. Fig. [\[fig:scaling\]](#fig:scaling){reference-type="ref" reference="fig:scaling"} (left) shows that while this does lead to superior CEM action prediction, it also degrades generation quality, possibly due to smaller number of training episodes. However, the semantic encoders are more robust to this degradation. **Model scaling improves both visual quality and policy success, with larger gains for reconstruction latents:** in Fig. [\[fig:scaling\]](#fig:scaling){reference-type="ref" reference="fig:scaling"} (right), we see that both generation (SSIM) and policy performance (VLA-SR) generally scale with the DiT size. Here, VAE scales notably well on visual metrics and approaches semantic encoders, which already perform strongly at DiT-S.

r0.4 ![image](figures/results/noadpt_vs_adapter_grouped_warm_cool.png){width="\\linewidth"}

Do reconstruction-aligned and semantic encoders fail differently?
-----------------------------------------------------------------

#### The main failure modes differ: reconstruction latents hallucinate task semantics, while semantic latents miss geometry and contact.

Our qualitative rollouts in Appx. fig. [10](#fig:hallucinated-rollout-pixels2){reference-type="ref" reference="fig:hallucinated-rollout-pixels2"} show that all encoder families share a common failure mode where static scene elements are faithfully preserved while manipulation-relevant details hallucinate. Beyond this universal pattern, encoder families show distinct hallucinations. Reconstruction encoders tend to fail at the object-semantic level: VAE and Cosmos hallucinate the white basket and green towel respectively in Fig. [4](#fig:success-rate-comp){reference-type="ref" reference="fig:success-rate-comp"} producing coherent looking but task-incorrect states, and under OOD instructions (Appx. Fig. [14](#fig:ood-instruction-same-ep){reference-type="ref" reference="fig:ood-instruction-same-ep"}), both maintain the prior action pattern rather than updating to the new goal. Semantic encoders preserve task-level intent at the cost of geometric precision (e.g., VJEPA2.1 under-opens the drawer in Appx. Fig. [10](#fig:hallucinated-rollout-pixels2){reference-type="ref" reference="fig:hallucinated-rollout-pixels2"}). We find the latter to better capture semantic distinctions even under instruction shift (e.g., the fold-unfold task in Appx. Fig. [12](#fig:ood-comparison-same-ep){reference-type="ref" reference="fig:ood-comparison-same-ep"}).

Do compressed adapter latents aid semantic encoders further for world modeling?
-------------------------------------------------------------------------------

#### Adapters improve diffusion ease but can distort control geometry.

Fig. [\[fig:adapt-vs-noadapt\]](#fig:adapt-vs-noadapt){reference-type="ref" reference="fig:adapt-vs-noadapt"}, Table [\[tab:policy-perf\]](#tab:policy-perf){reference-type="ref" reference="tab:policy-perf"}, and Table [\[tab:visual-quality\]](#tab:visual-quality){reference-type="ref" reference="tab:visual-quality"} show that the compressed space $d_{96}$ of adapters helps the latent diffusion model, as also observed by @zhang2025both and @baiSemanticGenVideoGeneration2025. This leads to generally stronger performance than the native variants on most metrics except latent CEM action error, OOD robustness, and PCK coverage. These findings hint towards the adapter compressing the latent space in a way that is useful for high-level task completion such as diffusion denoising but hurtful for fine-grained tasks like trajectory optimization, where precise action information is needed.

Do high-dimensional semantic latents and adapter add computational overhead?
----------------------------------------------------------------------------

**High-dimensional semantic latents do not substantially increase DiT compute in our setup.** The DiT always receives the same number of tokens per frame $N{=}256$, hence larger channel dimensions only affect the input/output projections (see Appx. [10.2](#app_subsec:action-conditioned-diffusion){reference-type="ref" reference="app_subsec:action-conditioned-diffusion"} for discussion). The main compute differences instead come from the frozen encoder and decoder architectures. In particular, ViT-based semantic encoders paired with the adapter pixel decoder remain competitive in total GFLOPs, while native high-dimensional semantic spaces require only a lightweight wide DDT head [@wang2025ddt]. We report parameter counts and GFLOPs split by encoder, adapter, DiT, and decoder in Appx. Table  [\[tab:arch-param-compute\]](#tab:arch-param-compute){reference-type="ref" reference="tab:arch-param-compute"}.

-   **Visual fidelity does not always imply downstream performance.** Reconstruction latents can match or exceed semantic latents on pixel-level metrics, especially at larger DiT scale, yet lag on action recovery, task-success probes, CEM planning, and policy-in-the-loop evaluation.

-   **Semantic latents scale better with multiple views.**

    Under limited data, adding multiple views improves planning but can hurt visual rollouts; semantic encoders retain the action recoverability benefit with substantially less degradation than reconstruction latents.

-   **Adapters trade control geometry for diffusion ease.** Adapters ease diffusion and decoding, but can distort fine-grained action geometry compared with native semantic features.

-   **World models in semantic spaces lower reconstruction and generation ceiling gap.** Training decoders with the same budget for semantic world models is more effective.

-   **High-dimensional semantic latents are practical in DiTs.** With a fixed patch-token count, semantic width adds little to the transition-model cost.

A Recipe for Semantic Latent Diffusion Robotics World Modeling
==============================================================

Our findings suggest a practical recipe for building robotic latent diffusion world models. . Instead, choose a latent space that makes explicit, make that space easy for diffusion to model, and evaluate the resulting world model with control- and policy-based metrics. Visual realism can often be improved through better decoder training, but transition quality and latent fidelity remain important. Use robot demonstration datasets with preferably and, when available, success/failure labels to unlock diverse evaluations. as the default latent state space, since they preserve action geometry and task progress better than reconstruction latents. when decoded rollout quality or VLA-in-the-loop evaluation matters. For transition model, a robust default for high-dimensional semantic spaces is: a with causal temporal blocks, a shallow-wide DDT head [@wang2025ddt], and a dimension-aware noise shifting [@zheng2025diffusion]. The spatial blocks stay non-causal since per-frame patches are denoised jointly. For training, diffusion forcing [@chen2024diffusion] can be used for autoregressive next-frame rollout. Finally, evaluate world models on covering both visual, latent, and downstream task performance.

[\[tab:visual-quality\]]{#tab:visual-quality label="tab:visual-quality"}

![**Open-VLA success rate comparison on two random episodes:** four frames are sampled at even intervals. []{style="color: green"} and []{style="color: red"} show trajectories marked as success and failure by InternVL 3.5 VLM.](figures/results/comparison_vla.png){#fig:success-rate-comp width="\\textwidth"}

Related work
============

**Robotic world models** can be seen to span three related objectives. One line treats world models as policy-evaluation environments: WorldGym [@quevedo2025worldgym] and WorldEval [@li2025worldeval] roll out policies in learned video models; [@tseng2025scalable] studies how pretraining, data diversity, and failure modes affect evaluation. A second line adapts pretrained generators into interactive simulators: UniSim [@yang2023learning] learns interactive real-world simulators from broad data; Vid2World [@huang2025vid2world] causalizes video diffusion with action guidance; Ctrl-World [@guo2025ctrl] studies multi-view, long-horizon, policy-in-the-loop manipulation. A third line moves prediction and planning into semantic feature space: DINO-WM [@zhouDINOWMWorldModels2025], DINO-world [@baldassarre2025back], and V-JEPA 2-AC [@midovjepa2] show that pretrained representations can support latent space forecasting and zero-shot or few-shot planning. These works establish the utility of both video generation and semantic representations, but do not isolate the encoder-defined latent space within a unified action-conditioned framework.

**World model evaluation** has moved beyond rollout plausibility and policy ranking toward physics, semantics, and embodied utility [@li2025evaluating; @MurLabadia2026VJEPA2U]. RBench [@li2024rbench] measures task correctness and structural realism. WorldModelBench [@li2025worldmodelbench] highlights instruction-following and physics-adherence failures missed by generic video metrics. EWMBench [@yue2025ewmbench] evaluates scene consistency, motion correctness, and semantic alignment. World-in-World [@zhang2025world] prioritizes closed-loop task success, WoW-World-Eval [@fan2026wow] adds inverse-dynamics-based action plausibility, and WorldArena [@shang2026worldarena] exposes the gap between perceptual quality and downstream functionality. These benchmarks evaluate world models at system-level while we seek to evaluate them at model-level. See Appx. [10.1](#app:ldm-lit){reference-type="ref" reference="app:ldm-lit"} for a review of LDM.

Future Work and Limitations {#sec:future_work}
===========================

Our study isolates the effect of encoder-defined latent spaces within a controlled action-conditioned LDM protocol. The conclusions are therefore scoped to the Bridge V2 manipulation setting and a shared robot embodiment. Evaluating broader embodiments, domains, and data regimes is an important next step. Our policy-in-the-loop experiments also focus on evaluating a fixed VLA policy inside generated rollouts, while policy improvement and sim-to-real transfer would test a complementary use of the same world models. Lastly, our evaluation partially relies on VLM-based success judgments, which may introduce evaluator bias. We reduce this dependence by aggregating multiple VLMs and pairing them with non-VLM diagnostics, including CEM planning, inverse dynamics, latent success classification, and visual/geometric metrics.

Conclusion
==========

Our study shows that the encoder-defined latent space is a central design choice for action-conditioned latent diffusion world models in robotics. Across visual, latent, planning, and policy-in-the-loop evaluations, semantic representation spaces such as that of V-JEPA 2.1, Web-DINO, and SigLIP 2 generally provide stronger action recoverability, task-success classification accuracy, robustness, and downstream policy performance than reconstruction-aligned VAE-style latents, even when the latter remains competitive or superior on low-level photometric metrics. These results support the view that robotic world models should not be selected solely by visual realism, but by whether their latent dynamics preserve action-relevant structure and policy evaluation accuracy.

Nilaksh is partly supported by a grant (<https://doi.org/10.69777/2009238>) from the Fonds de recherche du Québec (FRQNT). Saurav Jha is supported by the IVADO postdoctoral fellowship and the Canada First Research Excellence Fund. Sarath Chandar is supported by the Canada CIFAR AI Chairs program, the Canada Research Chair in Lifelong Machine Learning, and the NSERC Discovery Grant. This research was enabled in part by compute resources provided by Mila ([mila.quebec](mila.quebec)) and the Digital Research Alliance of Canada ([www.alliancecan.ca](www.alliancecan.ca)).

Frequently Asked Questions (FAQs) {#app:faq}
=================================

1.  **What are the parameter counts and GFLOPs of the full diffusion pipelines for each of the encoder families? How is the parameter/compute parity ensured with adapters and wide heads?**

    We show in Table [\[tab:arch-param-compute\]](#tab:arch-param-compute){reference-type="ref" reference="tab:arch-param-compute"} in Appx. [10](#app:archi-and-training){reference-type="ref" reference="app:archi-and-training"} the summary of the parameter counts and compute required for inference of all semantic spaces, with and without adapters.

    Parameter and compute parity are ensured by keeping the same DiT backbone across all rows and giving every model the same 256 tokens per frame. For adapter-based semantic encoders, the S-VAE adapter compresses high-dimensional features to 96 channels, making the DiT almost identical to the VAE-latent case. For native semantic latents, only the shallow input/output projection or wide head changes, so the *extra parameters do not increase DiT depth and add little compute*. Thus, the comparison is not driven by a larger diffusion model, it isolates the effect of using richer semantic representation spaces, which remain competitive in compute while providing stronger task-relevant structure.

2.  **How sensitive are the policy-in-the-loop results to the choice of VLM judges? Are inter-judge agreements available?**

    The policy-in-the-loop results do show sensitivity to the VLM judge, particularly on harder tasks: agreement is high on simple Level 1 tasks, while Level 2--4 tasks involve finer spatial, contact, deformable-object, and stacking judgments that naturally induce more judge variation; see Table [\[tab:vla-per-instruction-ditl\]](#tab:vla-per-instruction-ditl){reference-type="ref" reference="tab:vla-per-instruction-ditl"} for detailed results. We therefore rate trajectories with three VLMs and select the two most correlated judges, InternVL3.5-14B and Qwen3.6-27B, based on inter-judge Cohen's $\kappa$ agreement (Fig. [5](#fig:vlm-kappa){reference-type="ref" reference="fig:vlm-kappa"}). To further reduce single-judge dependence, Table [\[tab:policy-perf\]](#tab:policy-perf){reference-type="ref" reference="tab:policy-perf"} reports both consensus success rates with variance and Borda ranks, which are less sensitive to absolute score calibration. Finally, our conclusions do not rely only on VLM ratings: we also report task-instruction-conditioned success-classifier metrics on generated latents in Table [\[tab:idm-suc-dits\]](#tab:idm-suc-dits){reference-type="ref" reference="tab:idm-suc-dits"}, providing an independent task-conditioned signal that supports the same trends.

3.  **Why was CEM chosen for latent space planning instead of gradient based planners or differentiable MPC?**

    We use CEM because latent-space planning involves non-convex objectives and noisy gradients. As a derivative-free optimizer, CEM is robust to black-box dynamics and compounding errors [@rubinstein2004cross]. Its stochastic search avoids local minima better than gradient-based or differentiable Model Predictive Control (MPC), motivating its use in PlaNet [@hafner2019learning] and CEM-MPC [@pinneri2021sample]. While gradient planners are faster, they are sensitive to model inaccuracies and gradient instability [@bharadhwaj2020model]. Consequently, CEM provides a conservative, reliable baseline for evaluating world-model quality.

4.  **Is there evaluation on another manipulation dataset or embodiment (e.g., ALOHA, Franka) to test generalization? What are the expected transfer and potential pitfalls?**

    Evaluation on additional embodiments is an important direction but outside the scope of this study, whose controlled comparison is centered on BridgeV2; we did, however, use SOAR data for training the success classifier. We expect the main conclusion, that semantic latents are more policy-relevant than purely reconstruction-aligned latents, to transfer most directly when object-centric semantics and action-conditioned contact dynamics remain comparable. Cross-embodiment evaluation on ALOHA-style bimanual manipulation, Franka setups, or broader simulators such as RoboCasa [@robocasa2024; @robocasa365] would introduce new challenges: different camera viewpoints, action spaces, gripper morphology, control frequencies, embodiment-specific failure modes, and sim-to-real gaps. These factors may require embodiment-specific action tokenization, calibration, or classifier re-training, making such benchmarks an excellent test of whether semantic latent world models generalize beyond a single robot-data distribution. We also mention this as a potential future work avenue in Section [7](#sec:future_work){reference-type="ref" reference="sec:future_work"}.

5.  **What is the benefit of diffusion models over non-diffusion world models that use semantic features for manipulation like DINO-WM and V-JEPA 2 AC?**

    DINO-WM and V-JEPA 2-AC provide compelling evidence that pretrained semantic features are useful for robotic prediction and planning, and we view them as complementary to our study rather than direct competitors. Our central research question is specifically how the choice of latent space affects *diffusion-based* action-conditioned world modeling, so comparing against non-diffusion architectures would conflate representation choice with model-family differences. Diffusion models are also a natural testbed for this question because they model a distribution over future sequences and can denoise an entire prediction horizon jointly, which may better capture multimodal futures and reduce the compounding errors associated with purely autoregressive one-step regression rollouts, although this is a mitigation rather than a guarantee. Thus, our experiments are intentionally scoped to isolate the effect of semantic versus reconstruction latents within a fixed LDM framework; broader comparisons to non-diffusion semantic world models are important future work.

6.  **How were the learning rates and other hyperparameters chosen for different encoder latent spaces?**

    We used the same optimizer and learning-rate recipe for all world models, rather than tuning separately for each latent space. Specifically, all DiTs were trained with AdamW, learning rate $10^{-4}$, betas $(0.9,0.99)$, weight decay $2\times 10^{-3}$, gradient clipping, EMA, linear warmup, and cosine decay. Our goal is to isolate the effect of the encoder-defined latent space, and per-encoder hyperparameter tuning would confound the comparison by giving different latent spaces different optimization budgets. For each model-size group, runs were trained under the same schedule and until losses had plateaued. Since each DiT-S run costs roughly 6--7 hours on 4 H100s, each DiT-L run about 34 hours, and adapter/pixel-decoder training about 55 hours, exhaustive sweeps over learning rate, weight decay, warmup, batch size, EMA, and noise schedule for every encoder would be prohibitively expensive. We therefore use a fixed standard recipe and report all models under the same optimization protocol.

Architecture and Training Details {#app:archi-and-training}
=================================

[\[tab:arch-param-compute\]]{#tab:arch-param-compute label="tab:arch-param-compute"}

Latent Diffusion Modeling (LDM) {#app:ldm-lit}
-------------------------------

LDM learns to denoise in compact reconstruction-aligned autoencoder spaces such as that of VAEs [@kingma2013auto]. Recent VAE variants include: Stable Diffusion 3 [@esser2024scaling] adapting autoencoding to rectified flow models, VA-VAE [@yao2025vavae] aligning autoencoders with vision foundation models, and Cosmos [@agarwal2025cosmos] providing tokenizers across flexible compression regimes. In parallel, semantic-aligned encoders (DINOv2 [@oquab2023dinov2], SigLIP [@tschannen2025siglip], Qwen-VL [@Qwen2.5-VL; @Qwen3-VL], V-JEPA 2.1 [@MurLabadia2026VJEPA2U]) provide structured visual features, but their high dimensionality can make generative modeling unstable  [@skorokhodov2025improving; @yu2024image]. Representation autoencoders (RAEs) address this by pairing frozen pretrained encoders with learned decoders [@zheng2025diffusion; @tong2026scaling], enabling semantic latent spaces that support both visual understanding and generation [@tong2026beyond]. However, high-dimensional RAE features can still suffer from off-manifold sampling and weak fine-geometry reconstruction [@zhang2025both], suggesting that RAEs do not simply replace VAEs but instead expose a tradeoff between pixel faithfulness and semantic abstraction [@zhang2026rae]. For robotics, this tradeoff implies that the best latent space is not necessarily the one that reconstructs frames most faithfully, but the one that preserves action-relevant dynamics for prediction, planning, and policy evaluation.

Action-Conditioned Diffusion Model {#app_subsec:action-conditioned-diffusion}
----------------------------------

The world model is trained in the latent space of a frozen visual encoder. Let $o_{0:T-1}$ be a video clip, $a_{0:T-1}$ the corresponding action sequence, and $f_{\phi}$ the frozen encoder. We first form latents $$z_{0:T-1} = f_{\phi}(o_{0:T-1}), \qquad
    z_t \in \mathbb{R}^{N \times D},$$ where $N=h \times w$ is the number of spatial tokens and $D$ is the native encoder channel dimension. In the code tensors are stored as $h \times w \times D$, but the notation below flattens space to $N$ tokens. For adapter-based semantic encoders, $z_t$ is further compressed by the adapter $\alpha_{\psi}$ before being passed to the diffusion model, $$\tilde z_t = \alpha_{\psi}(z_t), \qquad
    \tilde z_t \in \mathbb{R}^{N \times d}, \qquad d=96.$$ The adapter and encoder are frozen during world model training; only the DiT parameters are optimized.

[\[tab:dit-size-presets\]]{#tab:dit-size-presets label="tab:dit-size-presets"}

All DiT runs in Table [\[tab:arch-param-compute\]](#tab:arch-param-compute){reference-type="ref" reference="tab:arch-param-compute"} use a DiT-L backbone with 24 layers, hidden size 1024, 16 attention heads, and $T{=}10$ frames. The context length is $H{=}2$, so the model conditions on $\tilde z_{0:H-1}$ and predicts the future block $\tilde z_{H:T-1}$ under actions $a_{0:T-1}$ and optional language $\ell$. The VAE latent has shape $32{\times}32{\times}16$ and is patchified with DiT patch size $p{=}2$, while all semantic, Cosmos, and VA-VAE latents use a $16{\times}16$ token grid with $p{=}1$. Thus every row gives the DiT the same number of tokens per frame: $$N = (h/p)(w/p) = 16 \cdot 16 = 256 .$$ **This is the main reason high-dimensional semantic latents do not substantially increase DiT compute:** the transformer blocks operate on the same token count and hidden width, and the latent channel dimension only appears in the input patch projection and output prediction layer.

#### Shallow-wide DDT head.

For high-dimensional representation latents, we also use a lightweight *shallow-wide* DDT head [@wang2025ddt]. The DiT backbone remains unchanged. The shallow-wide head uses a 2048-dimensional readout width and keeps a minimal spatial refinement stage before the final patch prediction layer. This adds local spatial processing capacity at the output while leaving the main DiT backbone unchanged. As a result, the shallow head can improve the mapping from backbone features to high-dimensional representation with minimal increase in parameters.

#### World model training hyperparameters.

All world models are trained on Bridge V2 clips resized to $256{\times}256$, with $T{=}10$ frames, $H{=}2$ history frames, frame skip 2, and 7-dimensional actions. Unless otherwise stated, the reported single-view runs use distributed data-parallel training on 4 H100 GPUs, per-GPU batch size 16 for DiT-S and 5 for DiT-L, bfloat16 autocast, and `torch.compile` [@ansel2024pytorch]. The optimizer is AdamW [@loshchilov2018decoupled] with learning rate (LR) of $10^{-4}$, betas $(0.9,0.99)$, weight decay $2{\times}10^{-3}$, $\epsilon{=}10^{-8}$, and gradient clipping at global norm 1.0. We maintain an EMA copy of the DiT weights with decay 0.9995. The LR schedule is a linear warmup followed by cosine decay to $0.7$ of the base LR. All runs use 3 LR warmup epochs and 100 total epochs.

#### Flow matching.

The model is trained with the optimal-transport flow-matching objective [@lipman2023flow]. For future frames $i \in \{H,\dots,T-1\}$, we sample $\tau_i \sim p(\tau)$, draw $\epsilon \sim \mathcal{N}(0,I)$, and linearly interpolate between data and noise: $$\tilde z_{\tau_i,i} = (1-\tau_i)\tilde z_i + \tau_i \epsilon_i .$$ The DiT predicts the velocity field $v_{\theta}(\tilde z_{\tau}, \tau, a_{0:T-1}, \ell)$, and the target velocity is $$u_i = \epsilon_i - \tilde z_i .$$ With clean history context, the training loss is $$\mathcal{L}_{\mathrm{FM}}
    =
    \mathbb{E}_{\tilde z,\epsilon,\tau}
    \left[
    \sum_{i=H}^{T-1}
    \left\|
    v_{\theta}(\tilde z_{\tau,i}, \tau_i, a_{0:T-1}, \ell) -
    (\epsilon_i - \tilde z_i)
    \right\|_2^2
    \right].$$ We only apply this loss to future frames. History frames are used as conditioning context with no diffusion noise $(\tau=0)$. During training, they however receive small Gaussian augmentation, $$\tilde z^{\mathrm{ctx}}_{\mathrm{aug}}
    =
    \frac{\tilde z^{\mathrm{ctx}} + \sigma_h \eta}
         {\sqrt{1+\sigma_h^2}},
    \qquad
    \eta \sim \mathcal{N}(0,I),$$ which prevents the model from overfitting to perfectly clean context latents.

#### Dimension-dependent noise schedule shift.

For non-VAE latents, the timestep distribution is shifted as a function of the latent dimensionality seen by the DiT. Following @esser2024scaling and @zheng2025diffusion, we use the shift: $$\gamma = \sqrt{\frac{(256/p^2)d}{4096}}, \qquad
    \tau' = \frac{\gamma \tau}{1 + (\gamma - 1)\tau}.$$ Here $d$ is the DiT input channel count after any adapter. This makes the noise level depend on the latent representation size rather than only on image resolution.

#### Inference and causal attention.

All our world models carry out autoregressive inference in latent space. Given encoded history $\tilde z_{0:H-1}$, the sampler appends a Gaussian latent for the next frame and integrates the learned velocity field backward from $\tau{=}1$ to $\tau{=}0$ with 10 Euler steps [@lipman2023flow; @esser2024scaling]: $$\tilde z_{\tau_{j+1},t} =
    \tilde z_{\tau_j,t} - (\tau_j-\tau_{j+1})
    v_{\theta}(\tilde z_{\tau_j,0:t}, \tau_j, a_{0:t}, \ell)_t .$$ The generated frame is then appended to the context and the process repeats for the desired horizon. Our temporal attention blocks are causal where each spatial token attends only to its own past states, following the causal video-transformer design used by VDT [@lu2024vdt].

Adapter
-------

High-dimensional semantic encoders produce per-patch features $z \in \mathbb{R}^{N \times D}$ that are prohibitively expensive for the diffusion model to operate on directly. We pair them with an S-VAE adapter [@zhang2025both] that compresses $z$ to a compact latent $\tilde{z} \in \mathbb{R}^{N \times d}$ ($d \ll D$). The adapter $a_\psi$ comprises a Transformer encoder $g_{\psi}^{\mathrm{enc}}$, a per-token diagonal-Gaussian bottleneck, and a Transformer decoder $g_{\psi}^{\mathrm{dec}}$: $$\begin{aligned}
    h &= g_{\psi}^{\mathrm{enc}}(z), \\
    (\mu,\log\sigma^2) &= W_{\mu,\sigma^2}\, h, \\
    \tilde{z} &= \mu + \sigma \odot \xi, \qquad \xi \sim \mathcal{N}(0,I), \\
    \hat{z} &= g_{\psi}^{\mathrm{dec}}(\tilde{z}).\end{aligned}$$ Both $g_{\psi}^{\mathrm{enc}}$ and $g_{\psi}^{\mathrm{dec}}$ consist of 3 Transformer blocks at dimension $D$, each followed by LayerNorm. The encoder appends a linear head $D \to 2d$ and the decoder prepends a linear head $d \to D$. We default to using 12 attention heads and FFN width of 3072.

The adapter training loss is: $$\begin{split}
    \mathcal{L}_{\mathrm{adapter}}
    =
    \underbrace{\mathcal{L}_{\mathrm{MSE}}(z,\hat{z})
    + \lambda_{\mathrm{cos}}\mathcal{L}_{\mathrm{cos}}(z,\hat{z})
    + \lambda_{\mathrm{spec}}\mathcal{L}_{\mathrm{FFT}}(z,\hat{z})}_{\text{semantic reconstruction}}
    + \\ \lambda_{\mathrm{KL}}\,D_{\mathrm{KL}}\!\left(
      q_\psi(\tilde{z}\mid z)\,\|\,\mathcal{N}(0,I)
    \right)
    + \lambda_{\mathrm{pix}}\,\mathcal{L}_{\mathrm{pix}}(o,\hat{o}),
    \end{split}
    \label{eq:adapter_loss}$$ where $\hat{o} = \mathrm{Dec}(\tilde{z})$ is the pixel-decoder reconstruction. $\mathcal{L}_{\mathrm{MSE}}$ and $\mathcal{L}_{\mathrm{cos}} = 1 - \cos(z,\hat{z})$ jointly enforce feature-space fidelity: MSE penalises magnitude errors while the cosine term preserves directional (semantic) structure. $D_{\mathrm{KL}}$ regularizes the approximate posterior $q_\psi(\tilde{z}\mid z) = \mathcal{N}(\mu, \sigma^2 I)$ toward a standard Gaussian prior. $\mathcal{L}_{\mathrm{FFT}}$ is an $\ell_1$ loss on 1-D FFT magnitudes along the spatial-token axis, penalizing loss of high-frequency structure through the bottleneck. $\mathcal{L}_{\mathrm{pix}} =
  \mathcal{L}_{\mathrm{MSE}}(o,\hat{o})
  + \lambda_{\mathrm{LPIPS}}\mathcal{L}_{\mathrm{LPIPS}}
  + \lambda_{\mathrm{SSIM}}(1 - \mathrm{MS\text{-}SSIM})$ grounds the compact latent in pixel space. Following @zhang2025both, we use $\lambda_{\mathrm{spec}}{=}0.01$, $\lambda_{\mathrm{LPIPS}}{=}\lambda_{\mathrm{SSIM}}{=}0.5$. During DiT training, $\alpha_\psi$ is frozen and applied deterministically ($\tilde{z} = \mu$) as a fixed projection into the compact latent space.

#### Adapter training hyperparameters.

The encoder is frozen throughout adapter training. It is trained for 200 total epochs on Bridge V2, per-GPU batch size 16 for single-view training, and bfloat16 autocast. The optimizer is AdamW with betas $(0.9,0.99)$ and weight decay $10^{-4}$. The base adapter learning rate is $10^{-4}$ for the single-view run; the pixel decoder uses a 3$\times$ learning-rate multiplier when trained jointly. Multi-view adapter fine-tuning uses learning rate $5{\times}10^{-5}$ and lower per-GPU batch sizes because each sample contains three camera views. The KL coefficient is linearly warmed up for the first 20% of optimizer steps to $\lambda_{\mathrm{KL}}{=}10^{-4}$, while $\lambda_{\mathrm{cos}}{=}1$ and $\lambda_{\mathrm{pix}}{=}1$. LPIPS, when enabled, is evaluated in float32 after a 50k-sample perceptual warmup. Gradients for both the adapter and pixel decoder are clipped to norm 1.0.

Pixel Decoder
-------------

The semantic encoders use the adapter pixel decoder for reconstruction. The pixel decoder maps compact latents $\tilde z \in \mathbb{R}^{N \times 96}$ to RGB observations: $$\hat o = \mathrm{Dec}(\tilde z) = D_{\omega}^{\mathrm{pix}}(\tilde z).$$ Architecturally, it is an LDM-style convolutional decoder with two residual blocks per level, and a 4-head self-attention block at $16{\times}16$ resolution. For the S-VAE setup, the pixel decoder is trained on detached adapter latents with the pixel loss $\mathcal{L}_{\text{pix}}$. As such, the pixel loss does not backpropagate into the adapter. The pixel reconstruction loss used in adapter training is $$\mathcal{L}_{\mathrm{pix}}
    =
    \|\hat o-o\|_2^2
    + \lambda_{\mathrm{LPIPS}}\mathcal{L}_{\mathrm{LPIPS}}(\hat o,o)
    + \lambda_{\mathrm{SSIM}}\left(1-\mathrm{MS\text{-}SSIM}(\hat o,o)\right).$$ In the S-VAE stage, the pixel decoder is trained on detached adapter latents, so pixel loss does not backpropagate into the adapter. The reported experiments use this S-VAE path rather than the older PS-VAE mode. For native semantic DiTs without an adapter in the diffusion model, visualization still uses the same surrogate path: native latent $\rightarrow$ adapter encoder $\rightarrow$ pixel decoder.

Encoder-specific overhead
-------------------------

Table [\[tab:arch-param-compute\]](#tab:arch-param-compute){reference-type="ref" reference="tab:arch-param-compute"} summarizes parameter counts and compute. We split the parameter counts by frozen encoder, adapter, DiT, and decoder. GFLOPs are reported per frame for a single $256{\times}256$ frame by counting multiply-add as two separate floating-point operations. The total compute column adds encoder, adapter projection when used, one DiT velocity evaluation, and the decoder used for visualization/reconstruction. The differences in total GFLOPs in Table [\[tab:arch-param-compute\]](#tab:arch-param-compute){reference-type="ref" reference="tab:arch-param-compute"} are therefore mostly due to the frozen encoder and decoder, and not the DiT backbone itself. The DiT sees the same $N{=}256$ tokens per frame across all models, so increasing the semantic latent channel dimension mainly changes the input/output projections. In contrast, the encoders use different network families: VAE and VA-VAE are convolutional autoencoders operating over high-resolution feature maps, V-JEPA 2.1 and Web-DINO are ViT-style patch encoders [@dosovitskiy2021an], and SigLIP 2 is a larger, higher-capacity ViT-style vision model. Decoder compute also differs substantially: VAE uses its native convolutional decoder, VA-VAE uses a lighter convolutional decoder, and the semantic encoders use the adapter pixel decoder from a compact $16{\times}16$ latent grid. Thus the native 1024--1152D semantic rows have nearly the same DiT GFLOPs as their adapter-based $d_{96}$ counterparts.

Table [\[tab:training-compute\]](#tab:training-compute){reference-type="ref" reference="tab:training-compute"} reports the measured training time and GPU configuration for the adapter/pixel-decoder stage and the DiT scaling runs.

[\[tab:training-compute\]]{#tab:training-compute label="tab:training-compute"}

Inverse Dynamics Model (IDM) {#app:idm_subsec}
----------------------------

The Inverse Dynamics Model (IDM) [@tian2025predictive] is a patch-token Transformer trained to predict an action chunk $\hat{a}_{t:t+k-1} \in \mathbb{R}^{k \times d_a}$ from a window of $k+1$ consecutive encoder latents $(z_t, z_{t+1}, \ldots, z_{t+k})$, where each $z_t = f_\phi(o_t) \in \mathbb{R}^{N \times D}$ is the spatial patch grid produced by the frozen encoder $f_\phi$ directly, *i.e.*, no adapter $a_\psi$ is applied, so the IDM always operates in the native encoder channel space of dimension $D$. Each frame's $N = h \times w$ patch tokens are projected by a shared linear layer into a model-width embedding, augmented with factored temporal and spatial positional embeddings, and then flattened into a joint sequence of $(k+1) \cdot N$ tokens. A set of $k$ learned per-step class token (CLS) readout queries is prepended to this sequence; all tokens attend jointly through $L$ pre-norm Transformer blocks with scaled dot-product self-attention [@vaswani2017attention], and the final-layer representations of the $k$ CLS positions are decoded by a two-layer MLP head to the predicted action chunk $\hat{a}_{t:t+k-1}$. Following @tian2025predictive, we train each encoder-specific IDM on real encoded trajectories from Bridge V2 with Smooth-L1 loss.

The IDM serves as a probe of action recoverability for each encoder space $f_\phi \in \Phi$. After training, it is evaluated at horizons $k \in \{1, 4\}$ using Pearson $r$ between the predicted action chunk $\hat{a}_{t:t+k-1}$ and the ground-truth $a^*_{t:t+k-1}$, averaged over the $d_a$ continuous action dimensions. Critically, the same frozen IDM head is then applied without retraining to world model-generated latent pairs $(\hat{z}_t, \hat{z}_{t+k})$ from DiT rollouts of the same episodes. The Pearson $r$ of the real-WM gap thus measures generation-induced erasure of the action-discriminative geometry in the latent space, a form of degradation invisible to pixel-level metrics such as SSIM or LPIPS.

VLA success classifier probe
----------------------------

The success classifier probe $s_\phi$ is a spatio-temporal Transformer trained on full latent trajectories $z_{0:T}$ from the SOAR dataset [@zhou2024autonomous] to classify episode success $y \in \{0,1\}$ given the language instruction $\ell$. Each trajectory's spatial latent grid is first downsampled to a $4{\times}4$ super-patch grid via adaptive average pooling, yielding $P{=}16$ spatial tokens per frame, and linearly projected to a shared model width of 384. Factored temporal and spatial positional embeddings are added in place, producing a token tensor of shape $(T \times P)$; a learned [cls]{.smallcaps} token is then prepended. Each of the six blocks of the success probe applies three sequential sub-operations with pre-norm and residual connections: a) spatial self-attention within each frame independently over the $P$ patch tokens, b) temporal self-attention across the $T$ frames independently per patch position, and c) cross-attention from all video tokens to the frozen SigLIP 2 token sequence encoding $\ell$, followed by a SwiGLU FFN. After the final RMSNorm, the mean of the $T \times P$ patch token representations is passed through a linear head to produce a binary logit $\hat{y}$.

The probe is trained with binary cross-entropy on SOAR episodes, with the encoder $f_\phi$, adapter $a_\psi$, and SigLIP 2 text encoder all frozen; only the parameters of $s_\phi$ are updated. Instruction-mismatch negatives (episodes paired with a language instruction drawn from a different task family) are mixed in to force the above cross-attention mechanism to genuinely ground success in the video content rather than ignoring $\ell$. Checkpoints are selected by balanced accuracy with ROC-AUC as the tie-breaker, accounting for SOAR's 1:2 success-to-failure class imbalance. At evaluation, the same frozen $s_\phi$ is applied without retraining to world model-generated latent trajectories $\hat{z}_{0:T}$ from DiT rollouts of the same SOAR episodes. The drop in balanced accuracy from **Enc. Acc** to **WM Acc** measures the semantic drift, *i.e.*, the degree to which the transition model $p_\theta$ degrades task-outcome separability in latent space over the full rollout horizon, a signal invisible to per-step action metrics.

Evaluation metrics {#app:eval_metrics}
==================

Planning and downstream policy performance
------------------------------------------

We evaluate planning and policy performance through three complementary sub-protocols: CEM-based latent controllability, VLA-in-the-loop closed-loop success, and robustness under distribution shift. Throughout, $a_t \in \mathbb{R}^{d_a}$ is the action vector with seven degrees of freedom, $a^*_{t:t+k-1}$ is the ground-truth $k$-step action sequence, and $\tilde{z}_t$ is the compact latent on which the DiT $p_\theta$ operates.

#### A) CEM action controllability.

We evaluate whether a trained world model preserves action information by asking whether actions can be recovered from its latent dynamics. Given a held-out transition window with two real context latents, $(\tilde z_t,\tilde z_{t+1})$, ground-truth action sequence $a^*_{t+1:t+k}$, and target future latents $\tilde z^*_{t+2:t+k+1}$, we solve $$a^{\mathrm{plan}}_{t+1:t+k}
  =
  \argmin_{a_{t+1:t+k}}
  \frac{1}{k}
  \sum_{j=1}^{k}
  \left\|
    p_\theta^{(j)}(\tilde z_t,\tilde z_{t+1}, a_{t+1:t+k})
    - \tilde z^*_{t+1+j}
  \right\|_2^2 .
  \label{eq:cem_appendix}$$ Here $p_\theta^{(j)}$ denotes the $j$th autoregressive latent prediction from the world model. We report results for $k \in \{1,4\}$ using 100 held-out windows per model.

The optimization in Eq. [\[eq:cem\_appendix\]](#eq:cem_appendix){reference-type="eqref" reference="eq:cem_appendix"} uses the cross-entropy method (CEM) [@rubinstein2004cross]. For each transition window, CEM maintains a diagonal Gaussian over the optimized action coordinates for all $k$ steps. In the reported runs, we use a population of 400 candidate action sequences, 5 CEM iterations, and 50 elites per iteration, i.e. an elite fraction of $0.125$. The sampling distribution is initialized with mean $a^*_{t+1:t+k}$ on the searched coordinates and standard deviation equal to one quarter of the action range for each searched coordinate. After each iteration, the Gaussian mean and standard deviation are set to the empirical mean and standard deviation of the elite set.

Each CEM candidate is evaluated with one latent rollout sample. The diffusion sampler uses the same inference setting as evaluation, with 10 flow-matching Euler steps per predicted latent frame. To make the CEM objective deterministic for a given transition, we sample one Gaussian rollout-noise tensor per transition window and reuse it for all candidates and all CEM iterations. Thus the world-model rollout is stochastic across evaluation windows through the sampled diffusion noise, but the optimizer sees a fixed objective within each window. For $k>1$, candidates are evaluated by a joint autoregressive rollout: after the first predicted latent, the prediction is appended to the context and used to predict the next latent under the next candidate action.

We compute the **CEM error** from the recovered action sequences: $\frac{1}{k}\sum_{j=1}^{k}
  \|a^{\mathrm{plan}}_{t+j,S}-a^*_{t+j,S}\|_2$, averaged over transitions, where $S$ is the set of searched action dimensions. Lower error indicates that the world-model latent dynamics are more action-sensitive under the CEM inversion test.

#### B) VLA-in-the-loop closed-loop success.

We roll out OpenVLA-7B [@kim2025openvla] inside each world model for 50-step episodes across 20 Bridge V2 test episodes with $8$ independent trials per episode (*i.e.*, $N=80$ total rollouts ). Each rollout video is scored by two VLMs, InternVL-3.5-14B [@wang2025internvl3] and Qwen-3.6-27B [@qwen3.6-27b] using 16 tail-biased frames sampled from the rollout. We use these to compute the following closed-loop success metrics:

-   Consensus success rate (**Consensus SR**) reports the fraction of trials scored as a success by *both* raters simultaneously: $\mathrm{CSR} = \frac{1}{N}\sum_{i}\mathbf{1}[\mathrm{score}_i^{\mathrm{InternVL}} \geq 0.5 \wedge \mathrm{score}_i^{\mathrm{QwenVL}} \geq 0.5]$. Requiring agreement from both raters reduces false positives from any single rater's miscalibration.

-   **Borda rank** is the sum of rank positions across both raters within each DiT-size group: $\mathrm{Borda} = r_{\mathrm{InternVL}} + r_{\mathrm{QwenVL}}$, where $r_{\mathrm{InternVL}}$ and $r_{\mathrm{QwenVL}}$ are the ordinal ranks of the model by SR-InternVL and SR-QwenVL respectively: rank 1 being the best. This is an ordinal measure robust to rater calibration drift and a lower score is better.

#### C) VLM interaction-quality rubric.

Each rollout is additionally scored by InternVL 3.5 [@wang2025internvl3] on a structured rubric with three independent sub-scores on a 1--5 integer scale, then averaged across the $N$ trials [@shang2026worldarena]. It is rated by a VLM using the prompt described in Sec. [11.5](#app:vlm-judge){reference-type="ref" reference="app:vlm-judge"}.

-   Interaction quality score (**IQ score**$\uparrow$) measures the plausibility of robot--object contact, including whether grasps, pushes, and force transfers look realistic and avoid interpenetration artifacts. This helps capture whether the world model renders credible manipulation dynamics without requiring pixel-level ground truth.

-   Instruction following (**Instr. follow**$\uparrow$) is the degree to which the rollout visually executes the language instruction $\ell$ (e.g., grasping the correct object, moving in the specified direction). Instruct follow is complementary to binary SR in the sense that it captures partial progress on episodes where neither judge counts the rollout as a full success.

#### D) Out-of-Distribution (OOD) robustness.

We re-run a subset of 10 tasks from the 20 used for calculating VLA SR, $8$-trial setup under two independent perturbations. Distractor-object (OOD distractor) rollouts add OOD objects to the scene as described in Sec. [11.5](#app:vlm-judge){reference-type="ref" reference="app:vlm-judge"}, while OOD-instruction rollouts replace the language instruction $\ell$ with a semantically unrelated instruction drawn from a different Bridge V2 task family. Success rates under perturbation use the mean of the two per-rater SRs:

-   OOD SR Distractor: the per-rater mean SR under distractor objects.

-   OOD SR instruction: the per-rater mean SR under the substituted instruction.

Pixel fidelity and scene geometry {#app:pixel-metrics}
---------------------------------

Action faithfulness is a necessary but not sufficient condition for world modeling, *e.g.*, a model that steers correctly yet generates physically implausible scenes will still mislead a policy that relies on visual observations. We thus evaluate decoded rollout quality across three categories --- visual quality, content consistency, and motion quality --- each containing *reference-based* metrics that compare generated frames $\hat{o}_t$ to paired ground-truth frames $o_t^*$, and *reference-free* perceptual metrics [@shang2026worldarena] that score generated clips without a ground-truth counterpart. All metrics are computed over 1,000 test episodes.

#### A) Visual quality.

Reference-based metrics include:

-   **PSNR**$\uparrow$ measures the peak signal-to-noise ratio $10\log_{10}(1/\mathrm{MSE}(\hat{o}_t, o_t^*))$, averaged over frames and episodes. This helps quantify pixel-level reconstruction accuracy but not the perceptual structure.

-   **SSIM**$\uparrow$ measures structural similarity [@wang2004image] between $\hat{o}_t$ and $o_t^*$, computed on luminance with a local window. Captures structural and contrast coherence that PSNR misses.

-   **LPIPS**$\downarrow$ measures the learned Perceptual Image Patch Similarity [@zhang2018unreasonable] using AlexNet [@krizhevsky2012imagenet] features. LPIPS correlates better with human perceptual judgments than pixel-level metrics, penalizing blurry or structurally incorrect generations even when MSE is low.

-   **FID**$\downarrow$ quantifies the Fréchet Inception Distance [@heusel2017gans] between the distribution of generated and ground-truth frames, computed from InceptionV3 2048-D features [@szegedy2016rethinking]. FID measures the population-level gap between generated and real frame distributions, capturing systematic biases that per-frame metrics average away.

Reference-free metrics are borrowed from @shang2026worldarena and include:

-   **Image quality**$\uparrow$ measures the MUSIQ [@ke2021musiq] multi-scale image quality score, normalized to $[0,1]$. This helps quantify the perceptual quality of individual frames using a model trained on human quality ratings, without requiring a ground-truth reference.

-   **Aesthetic quality**$\uparrow$ uses the LAION aesthetic predictor score [@schuhmann2022laion], normalized to $[0,1]$ from a raw $[0,10]$ scale. This helps capture the compositional and stylistic appeals of generated frames independently of content accuracy.

-   **JEPA similarity $\uparrow$** measures the maximum mean discrepancy (MMD) between feature distributions extracted from JEPA [@assran2023self] to provide evaluation results that better align with human perception.

#### B) Motion quality.

Reference-based metrics include:

-   **FVD**$\downarrow$ measures the Fréchet Video Distance [@unterthiner2018towards] computed from ResNet-3D features on 16-frame clips. FVD helps extend FID to the temporal domain, thus capturing spatiotemporal distribution quality of full video clips rather than individual frames.

-   **t-LPIPS**$\downarrow$ uses RAFT [@teed2020raft] to estimate the optical flow $\mathbf{u}_{t-1 \to t}$ on the ground-truth frames. Both generated and ground truth (GT) frames are then warped with this shared flow. t-LPIPS is the mean absolute difference between the per-step LPIPS of the flow-warped generated video and the flow-warped GT video. Using GT flow as a shared reference decouples temporal dynamics quality from content. Here, a low score signifies the model's frame-to-frame motion pattern matches ground truth.

-   **PCK coverage**$\uparrow$ uses CoTracker [@karaev23cotracker] to track a $16{\times}16$ grid of query points placed on the first context frame through the generated video. PCK coverage is the mean fraction of these query points that remain visible (tracked with high confidence) at each rollout step. A drop across steps indicates that the generated video causes points to leave the frame or become untrackable, which implies geometric instability.

Reference-free metrics are borrowed from @shang2026worldarena and include:

-   **Dynamic degree**$\uparrow$ measures the fraction of inter-frame pairs in a generated clip where RAFT-estimated optical flow magnitude exceeds a threshold $\tau{=}6$ pixels. A near-zero value indicates a nearly static rollout, which is unlikely to be action-faithful regardless of pixel quality.

-   **Flow score**$\uparrow$ quantifies the mean magnitude of the top-5% of optical flow vectors across all inter-frame pairs in a generated clip. This helps capture the strength of dominant motion events, complementing dynamic degree which only measures their frequency.

#### C) Reconstruction ceiling.

For each encoder, all reference-based metrics are additionally computed on *reconstructed* frames, *i.e.*, real observations encoded and decoded without any DiT. This gives us a per-encoder upper bound. The gap $\Delta$ is the difference between the world model score and this ceiling, isolating the quality loss attributable to the transition model rather than the decoder. A large gap indicates that the DiT struggles to generate in-distribution latents while a small gap implies that the encoder--decoder path is not the bottleneck.

Latent representation quality
-----------------------------

#### A) Action Recoverability.

A world model can score well on PSNR/SSIM yet use an encoder that never encoded action information to begin with, or use a good encoder but a DiT that overlooks the action-discriminative geometry during denoising. Action recoverability metrics seek to address these and include the following reference-based measures:

-   **IDM Pearson $r$ (Encoder)** uses an Inverse Dynamics Model (IDM) head [@tian2025predictive] trained on consecutive frozen encoder latent pairs $(z_t,\,z_{t+k})$ from Bridge V2 to predict an action chunk $\hat{a}_{t:t+k-1} \in \mathbb{R}^{k \times 7}$ for horizon $k\in\{1,4\}$. Pearson $r$ is then computed by averaging over the six continuous action dimensions on held-out real encoded frames, establishing the maximum step-level action information linearly accessible from each encoder space.

-   **IDM Pearson $r$ (WM)** applies the same frozen IDM (trained on real latents) to world model generated latent pairs $(\hat{z}_t,\,\hat{z}_{t+k})$ from DiT rollouts of the same episodes. A small Real--WM $r$ difference confirms the transition model faithfully preserves action-relevant latent geometry during generation while a large difference exposes degradation invisible to pixel metrics. A large gap between Real and WM $r$ indicates generation-induced erasure of action-distinguishing structure even when decoded pixels look faithful.

#### B) Success classifier Accuracy or Success Separability.

We seek to measure whether the world model's generated latent trajectories retain enough task-outcome structure for a frozen success classifier to distinguish successful from failed episodes, *i.e.*, the DiT preserves semantic meaning over a full rollout and not just local action geometry. Semantic fidelity includes the following reference-based metrics that require ground-truth success/failure labels:

-   **Enc. Acc** signify the encoder ceiling, where a factored spatial--temporal attention probe $g_\phi$, conditioned on frozen SigLIP 2 text tokens, is trained on real encoder latent trajectories $z_{0:T}$ from SOAR [@zhou2024autonomous] to classify task success given the language instruction. Balanced accuracy on held-out real-encoded trajectories establish the probe ceiling, *i.e.*, the maximum task-success information preserved in each encoder space.

-   **WM Acc.** applies the frozen probe $g_\phi$ is applied without retraining to full world model generated latent rollouts of the same episodes. Lower WM Acc relative to Enc. Acc reveals *semantic drift*: the generated trajectory has lost task-outcome separability even when per-step action signals remain partially intact.

VLA-based evaluations {#app:vla-eval-ood}
---------------------

We manually pick the set of 20 tasks present in Table [\[tab:vla-per-instruction-ditl\]](#tab:vla-per-instruction-ditl){reference-type="ref" reference="tab:vla-per-instruction-ditl"} to have a good mix of task difficulties as well as task diversities from the Bridge V2 test set. The tasks involve instructions like pick and place, opening/closing, interacting with non-rigid objects like clothes, and tasks that require precise arm and gripper control. We use Claude Opus 4.7 to generate the OOD instructions given the original task instruction in Table [\[tab:ood-instruction-pairs\]](#tab:ood-instruction-pairs){reference-type="ref" reference="tab:ood-instruction-pairs"}. These OOD instructions also span several variations.

VLM prompts {#app:vlm-judge}
-----------

We list the exact prompts we used to create the out of distribution distractor images, score the VLA policy trajectories, and to score the interaction quality and related metrics. We provide a summary of the full prompt from @shang2026worldarena for the latter here. We chose a subset of 10 tasks equally sampled from the difficulty levels in Table. [\[tab:vla-per-instruction-ditl\]](#tab:vla-per-instruction-ditl){reference-type="ref" reference="tab:vla-per-instruction-ditl"} and use ChatGPT Images 2.0 model with the distractor prompt given below to generate the initial frame with OOD objects added to the scene.

Distractor Image Editing Text-guided distractor insertion for OOD distractor objects Exact prompt template:

> This is an initial observation for a robotics task {task\_instruction}.\
> Modify this image by adding distraction objects in the scene in a natural way without moving or changing any objects in the original scene.\
> Requirements:\
> - The robotic arm should be visible.\
> - With the distractors, the task {task\_instruction} should remain achievable.

Episode Success/Failure Scoring Online rollout scoring used by policy-in-the-loop evaluation Prompt structure:

> Here is a sequence of frames from a robot policy which has been rolled out in a video-generation-based world model. I need your help determining whether the policy is successful. How successfully does the robot complete the following task?\
> Instruction: {instruction}\
> Score rubric:\
> 0 = Failure\
> 0.5 = Partial (optional, when partial criteria are provided)\
> 1 = Success\
> Provide brief reasoning (2--3 sentences). Then output exactly one final line: Final Score: X

The binary version uses only `0/1`. The partial-credit version adds `0.5` when a `partial_criteria` string is present.

Interaction Quality / Perspectivity / Instruction Following Multi-dimensional VLM judge rubric for paper-style interaction-quality metrics. Prompt summary:

-   Evaluates **three** dimensions on a 1--5 Likert scale: Interaction Quality, Perspectivity, and Instruction Following.

-   Scene prior is explicit: tabletop or counter-top *robotic arm* manipulation, not human-hand videos.

-   Includes a hard hallucination check: if the video shows human hands instead of robotic arms, Instruction Following should be scored at most 2.

-   Requires the model to base judgments only on visible evidence in the sampled frames and to consider temporal coherence.

-   Output is forced to a single JSON object with exactly three top-level keys:

    > {\"Interaction\_Quality\": {\"score\": 1--5, \"reason\": \"\...\"},\
    > \"Perspectivity\": {\"score\": 1--5, \"reason\": \"\...\"},\
    > \"Instruction\_Following\": {\"score\": 1--5, \"reason\": \"\...\"}}

We rated all trajectories using three open-source, but strong Vision Language Models (VLMs), InternVL3.5-14B [@wang2025internvl3], Qwen3.6-27B [@qwen3.6-27b], and Qwen3.5-9B [@qwen3.5] with the same scoring prompt and sampled frames. We sampled 16 frames from each episodes, with 10 sampled uniformly throughout the video and 6 sampled uniformly from the second half of the episode. We did this since the ending of a trajectory often has more task success relevant information. We then calculate Cohen's kappa $\kappa$ to measures the agreement between each of the VLM raters (Fig. [5](#fig:vlm-kappa){reference-type="ref" reference="fig:vlm-kappa"}), and find that InternVL3.5-14B and Qwen3.6-27B are in moderate agreement. Thus we chose the consensus rating from these two VLMs for our success rate figures. We also verify that the main trend is supported by non-VLM metrics: CEM, IDM, success probes, and visual/geometric metrics.

![**The Cohen's kappa** for inter-VLM rater agreement. Given the higher agreement between InternVL 3.5 and Qwen 3.6, we choose these as our VLM judges for policy-in-the-loop task success experiments.](figures/results/E_fig2_rater_kappa_main_warm_cool.png){#fig:vlm-kappa width="50%"}

Additional Results {#app:additional-res}
==================

Visual performance across DiT backbone sizes
--------------------------------------------

![**SSIM gap, LPIPS gap, and PCK coverage** over 45 rollout steps. While all encoders show a strictly increasing SSIM/LPIPS gap over the full rollout due to compounding errors (each autoregressive step feeds back slightly corrupted predictions as context), semantic latent spaces from SigLIP2, V-JEPA 2.1 and Web-DINO remain particularly competitive when forced to extrapolate beyond the 10-frame horizon length seen during training. Conversely, PCK coverage remains the highest for semantic encoders.](figures/results/G_fig1c_noadpt_trajectory_full_main_warm_cool.png){#fig:rollout-gap-full width="\\textwidth"}

Policy performance across DiT backbone sizes {#app:policy-perf}
--------------------------------------------

Statistical Analyses {#app:stat-analysis}
--------------------

[\[tab:dits-bootstrap-policy-cem\]]{#tab:dits-bootstrap-policy-cem label="tab:dits-bootstrap-policy-cem"}

#### Uncertainty over policy-facing metrics.

The results show the same simple pattern across the policy-facing metrics: semantic latent spaces are better for task-relevant behavior than reconstruction latent spaces. For in-distribution VLA rollouts, semantic encoders exceed reconstruction encoders by 9.8 percentage points, with a 95% paired bootstrap interval of \[2.5, 17.7\] points and an exact one-sided sign-flip test of $p=0.0129$ over the 20 shared task episodes. The OOD result is also positive: when pooling distractor and instruction shifts, semantic encoders exceed reconstruction encoders by 13.6 percentage points, with a 95% bootstrap interval of \[8.8, 18.4\] points and $p<5{\times}10^{-5}$. For CEM action recovery, lower error is better; semantic encoders reduce one-step controllability error by 0.0266, with a 95% bootstrap interval of \[0.0122, 0.0412\] lower error and $p=0.00015$. Thus, the semantic-family advantage is statistically supported for VLA success, OOD success, and CEM action recovery.

Latent representation quality {#app:latent-rep}
-----------------------------

[\[tab:idm-pearson-k1-k4\]]{#tab:idm-pearson-k1-k4 label="tab:idm-pearson-k1-k4"}

[\[tab:success-probe-full\]]{#tab:success-probe-full label="tab:success-probe-full"}

![**Action trajectories induced by encoder spaces:** episode rollouts projected onto the top-2 canonical-correlation directions between IDM features and ground-truth actions. $(\rho_1, \rho_2)$ are the leading canonical correlations, $\eta$ summarizes the aggregate action alignment. Colored curves are episodes.](figures/results/action_subspace_idm_warm_cool.png){#fig:action-traj-full width="85%"}

Multi-view transfer learning
----------------------------

Effect of adapter dimension
---------------------------

r0.4 [\[tab:adapter-dim-ablation\]]{#tab:adapter-dim-ablation label="tab:adapter-dim-ablation"}

We observe that adapter dimension has a non-monotonic sweet spot. Table [\[tab:adapter-dim-ablation\]](#tab:adapter-dim-ablation){reference-type="ref" reference="tab:adapter-dim-ablation"} shows that the adapter bottleneck dimension has a non-monotonic effect on performance. For Web-DINO with DiT-S, the intermediate $d_{96}$ setting gives the best overall tradeoff, achieving the highest VLA success rate and the best LPIPS, FID, and FVD. Smaller bottlenecks such as $d_{16}$ remain competitive for policy performance but lose visual quality, while using the full $D_{1024}$ encoder output is worse than the compact $d_{96}$ adapter.

Additional Rollouts {#app:rollouts}
===================

We provide additional rollouts alongside the key observations for Open-VLA success rate comparison (Fig. [8](#fig:success-rate-comp-2){reference-type="ref" reference="fig:success-rate-comp-2"}), plain pixel rollouts for comparing differences between standard model outputs (Fig. [9](#fig:rollout-pixels2){reference-type="ref" reference="fig:rollout-pixels2"}) and hallucinated model outputs (Fig. [10](#fig:hallucinated-rollout-pixels2){reference-type="ref" reference="fig:hallucinated-rollout-pixels2"}), rollouts under OOD distractor objects as well as under OOD instructions for all models across diverse episodes (Fig. [11](#fig:ood-comparison){reference-type="ref" reference="fig:ood-comparison"}, [13](#fig:ood-instruction){reference-type="ref" reference="fig:ood-instruction"}) as well as on the same episode (Fig. [12](#fig:ood-comparison-same-ep){reference-type="ref" reference="fig:ood-comparison-same-ep"}, [14](#fig:ood-instruction-same-ep){reference-type="ref" reference="fig:ood-instruction-same-ep"}). We also provide sample rollout videos for analyses with the supplementary files.

![**Open-VLA success rate comparison on two random episodes:** four frames are sampled at even intervals. []{style="color: green"} and []{style="color: red"} show trajectories marked as success and failure by InternVL 3.5 VLM.](figures/results/comparison_vla_2.png){#fig:success-rate-comp-2 width="\\textwidth"}

![**Pixel rollout comparison across models on diverse episodes:** the first frame is fed as context and the rest 3 frames are sampled at even intervals from the generated world model rollout.](figures/results/comparison_combined.png){#fig:rollout-pixels2 width="\\textwidth"}

![**Hallucinated pixel rollout comparison across models on diverse episodes:** the first frame is fed as context and the rest 3 frames are sampled at even intervals from the generated world model rollout. **Top:** flipping the pot consistently causes distortions for all models; **Middle:** turning the book pages causes the models to only partially follow the motion with the book/page appearances becoming smeared and inconsistent, the page edges and cover boundaries drifting; **Bottom:** while all models predict the appearance of an opening drawer, some clearly under-predict the opening (e.g. VJEPA 2.1) while others show unstable drawer boundary and front panel (e.g. Cosmos). ](figures/results/comparison_combined_2.png){#fig:hallucinated-rollout-pixels2 width="\\textwidth"}

![**OOD Distractor comparison showing failure episodes per model:** OOD objects break task-object binding and action-conditioned state tracking across all models. []{style="color: green"} and []{style="color: red"} show trajectories marked as success and failure by InternVL 3.5. In their respective trajectories: Cosmos generates less stable towel/object state; VAE fails at task-relevant placement of the silver pot; V-JEPA 2.1 loses stable binding between the can, the blue fork, and the instruction, with the can failing to end up reliably behind the fork; Web-DINO fails to maintain the pile-forming interaction; SigLIP 2 keeps the stove layout recognizable, but it does not preserve the precise relation between the scrubber and the target-burner. ](figures/results/comparison_ood.png){#fig:ood-comparison width="\\textwidth"}

![**OOD Distractor comparison for the same episode:** OOD distractor competes with the target objects and exposes whether a model can keep the target objects bound to the instruction. []{style="color: green"} and []{style="color: red"} show trajectories marked as success and failure by InternVL 3.5. Here, irrespective of the task success, the added object visually changes the predicted interaction for all models: the robot/can motion becomes less task-directed, the can's position is less consistently moved behind the blue fork, and the models appear to let the distractor alter the scene dynamics. ](figures/results/comparison_ood_2.png){#fig:ood-comparison-same-ep width="\\textwidth"}

![**OOD Instruction comparison showing failure cases per model:** for each model, the same initial context is rolled out with the original instruction, which succeeds, and then with a OOD instruction, which fails. []{style="color: green"} and []{style="color: red"} show trajectories marked as success and failure by InternVL 3.5. Model-specific OOD instruction trajectories show: Cosmos rollout still moves the blue scrubber around the stove, but does not reliably bind it to the new target burner; VAE preserves the table scene, but fails to understand the spatial relation \"on top of\"; VJEPA 2.1 rollout continues to look like sweeping/piling behavior rather than reversing the task into scattering; Web-DINO keeps the towel manipulation plausible, but misses the new container-based goal; SigLIP 2 rollout shows the lid disappear off the frame. ](figures/results/comparison_ood_instruction.png){#fig:ood-instruction width="\\textwidth"}

![**OOD Instruction comparison for the same episode:** most models exhibit a common hallucination where the original object dynamics or defaults to a familiar action pattern instead of updating the final state to match the new instruction. []{style="color: green"} and []{style="color: red"} show trajectories marked as success and failure by InternVL 3.5. Both Cosmos and VAE maintain the cloth in a partially folded/creased state instead of flattening it. Semantic encoders more clearly capture the semantic difference between folding and unfolding with VJEPA 2.1 most clearly producing a flatter cloth for the OOD instruction. Web-DINO spreads the cloth, but with some shape distortion and robot occlusion while for SigLIP 2, the cloth shape becomes rounded, suggesting some geometry hallucination despite correct task-level outcome.](figures/results/comparison_ood_instruction_2.png){#fig:ood-instruction-same-ep width="\\textwidth"}