stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

Source

Status And Credibility

This is a 2026-05-20 arXiv preprint by Lucas Maes, Quentin Le Lidec, Luiz Facury, Nassim Massaudi, Ayush Chaurasia, Francesco Capuano, Richard Gao, Taj Gillin, Dan Haramati, Damien Scieur, Yann LeCun, and Randall Balestriero. It is credible enough to track as an important infrastructure source because it comes from the same broader JEPA/world-model research cluster as LeWorldModel and LeJEPA, includes an official GitHub repository and documentation, and was announced by Lucas Maes through an authenticated X thread. It is not yet peer reviewed, so benchmark, throughput, and platform-adoption claims should stay tied to the paper and released code.

Core Claim

stable-worldmodel (swm) is an open-source platform for standardized and reproducible world-model research. It unifies data collection, training support, baseline implementations, planning solvers, and evaluation protocols so that new action-conditioned world models can be compared with fewer hidden implementation differences.

The key wiki interpretation is narrow: swm is not itself a new foundation model. It is an infrastructure and benchmark source for testing whether learned latent dynamics remain useful under common planning, data-loading, and distribution-shift protocols.

Platform Interface

The paper’s abstraction stack is deliberately small:

  • World wraps Gymnasium-compatible environments for data collection, policy execution, vectorized evaluation, rendering, and controllable factors of variation.
  • Policy maps observations or latent states to actions/control inputs, including random policies, expert policies, learned policies, and MPC policies.
  • Solver optimizes finite-horizon action sequences against a world model cost, with sampling-based and gradient-based planners.
flowchart LR
  Env["World: environment + factors of variation"]
  Data["trajectory data: observations + actions + metadata"]
  Model["user world model: encoder + predictor"]
  Solver["solver: CEM / MPPI / iCEM / GD / PGD"]
  Policy["MPC policy"]
  Eval["evaluation: success, latency, OOD robustness"]

  Env --> Data
  Data --> Model
  Model --> Solver
  Solver --> Policy
  Policy --> Env
  Env --> Eval

For this wiki’s terminology, the main object is an action-conditioned trajectory interface:

Here observations are encoded into latent state, actions/control inputs condition predicted next state, and solvers compare candidate action sequences through model-predictive control.

Evidence And Results

  • The paper frames the core reproducibility problem as fragmented world-model pipelines: every paper may ship its own training stack, planner, and evaluation protocol, making it hard to separate model improvements from implementation differences.
  • The data layer uses Lance as the primary format while supporting MP4, HDF5, and LeRobot conversion. In the Push-T benchmark table, Lance local loading is reported at 4,815 samples/s without caching, versus 1,416 for local HDF5 and 1,331 for local video; Lance over S3 is reported at 3,184 samples/s without caching, versus 9 for HDF5 over S3.
  • The platform includes planning solvers such as CEM, iCEM, MPPI, Predictive Sampling, Gradient Descent, PGD, GRASP, and a Lagrangian solver.
  • The implemented baselines include DINO-WM, PLDM, LeWorldModel, TD-MPC2, and goal-conditioned RL baselines.
  • The benchmark suite spans classic control, MuJoCo, Atari, robotics, Craftax/open-world environments, and OGBench-style tasks, with controllable visual, geometric, and physical factors of variation.
  • In the Push-T case study, the paper reports that in-distribution success does not predict distribution-shift robustness. Prediction-error distributions for successful and failed plans overlap heavily under stronger shifts, and simple visual factors of variation sharply reduce planning success across baselines.
  • The online TD-MPC2 checks on DeepMind Control Suite tasks are used as implementation validation: the platform reports competitive rewards against SAC, while the same TD-MPC2 implementation struggles in the offline Push-T setting due to out-of-distribution action drift.

X Thread Notes

The authenticated X thread is useful because it gives the authors’ launch narrative and intended community use.

  • Lucas Maes frames the platform as a way to reduce the cost of entering JEPA and world-model research.
  • The thread emphasizes the same fragmentation problem as the paper: separate training stacks, planners, and evaluation pipelines make results hard to compare.
  • The thread names the baseline set as DINO-WM, PLDM, LeWM, TD-MPC2, and related methods.
  • The thread highlights customizable environment properties such as colors, shapes, textures, gravity, friction, and wind.
  • The thread emphasizes clean planning solvers and LanceDB/Hugging Face bucket streaming support.
  • In a reply, Maes summarizes the goal as building tools researchers can trust and lowering the cost of action for researchers.

These notes should be treated as official narrative context. The paper and code remain the evidence sources for technical claims.

Limitations

  • The source is a recent arXiv preprint, not a peer-reviewed venue paper.
  • The strongest experimental evidence is simulator-based and Push-T-centered; it does not yet show real-world robot transfer or operational time-series intervention modeling.
  • The framework standardizes evaluation, but it cannot by itself solve the brittleness it exposes: current world models still fail under mild visual shifts, out-of-distribution trajectories, and longer planning horizons.
  • Lance throughput and storage results are useful systems evidence, but they are author-reported and hardware/data-layout-dependent.
  • The platform is model-agnostic by design, so it does not close representation learning, action-conditioned dynamics, uncertainty, or causal-intervention slots by itself.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Control and counterfactualsadjacentStandardizes action-conditioned planning evaluation with MPC solvers over learned latent dynamics.Evidence is visual/control simulation, not numeric operational time series with typed interventions.
Benchmarks and evaluation protocolpartially closes outside time seriesAdds standardized goal-conditioned evaluation, factors of variation, latency, success, and OOD robustness checks for world models.Need comparable benchmarks for multivariate telemetry, graph time series, event streams, and real operator actions.
Data and system scalingadjacentLance-based data layer targets contiguous multimodal trajectories and reports large throughput gains over HDF5/video in the tested settings.Need large-scale numeric/time-series corpora, streaming updates, and reproducible cross-domain scaling studies.
Latent-state predictionadjacentUses the encoder/predictor world-model interface and includes LeWorldModel, PLDM, and DINO-WM baselines.It is an evaluation platform rather than a new latent-state objective.

Open Questions

  • Can swm become the default reproduction surface for JEPA world-model papers, or will authors still diverge through private data, custom wrappers, and unpublished solver settings?
  • Which factors of variation best predict real-world transfer: visual shifts, geometry, physics, action latency, partial observability, or data-collection policy?
  • Can the same standardized-planning interface be ported to non-vision action-conditioned time series such as observability, healthcare interventions, industrial control, or recommender systems?
  • How should evaluation combine prediction error, planning success, latency, uncertainty, and distribution-shift robustness without collapsing them into one misleading score?
  • Can a Lance-like contiguous trajectory store become a shared data substrate for high-dimensional multivariate time-series world models?