Digital World Models

Summary

Digital world models are learned predictive models of software-defined environments. Their states are things like DOM trees, GUI screens, files, repository state, program variables, API responses, permissions, game state, logs, command output, or other structured digital observations. Their actions are clicks, keystrokes, API calls, shell commands, file edits, code execution, navigation steps, deployments, rollbacks, autoscaling commands, or other typed interventions.

In Agentic World Modeling, the digital-world regime is defined by program semantics: API contracts, UI state machines, file-system logic, network protocols, type constraints, permissions, error branches, popups, timeouts, and other mechanically checkable transition rules. That is the main distinction from physical or social world models. Digital transitions are often deterministic in principle, but they branch heavily and are only partially observed by an agent.

Core Distinction

A digital world model is not merely a model that draws screenshots or writes code. The useful target is:

current digital state + candidate action + digital constraints
  -> predicted next state / rollout / outcome

For an L2 digital simulator, the prediction must remain useful over multiple steps, respond correctly to interventions, and respect software constraints. A generated GUI state that looks plausible but violates an API contract, loses file-system state, ignores an error code, or invents a nonexistent button is not decision-usable.

flowchart LR
  Agent["agent or planner"]
  Action["candidate action"]
  State["digital state"]
  WM["digital world model"]
  Rollout["predicted rollout"]
  Verifier["execution / type check / UI oracle / test"]
  Decision["rank, act, wait, or escalate"]

  Agent --> Action
  State --> WM
  Action --> WM
  WM --> Rollout --> Verifier --> Decision
  Decision --> Agent

Representative Systems In Agentic World Modeling

The survey lists these as representative L2 digital-world anchors. They are not all separately ingested in this KB yet, so use this table as survey attribution rather than as a local independent verdict on every system.

SystemPaper title in the survey bibliographyWhat it is trying to model
WebDreamerIs Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web AgentsWeb state transitions for model-based web-agent planning.
GameCraftHunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History ConditionAction-conditioned interactive game video rollouts.
MobileDreamerMobileDreamer: Generative Sketch World Model for GUI AgentMobile GUI future state as task-related sketches.
Word2WorldFrom Word to World: Can Large Language Models be Implicit Text-based World Models?Text-only state/action environments as implicit world models.
Code2WorldCode2World: A GUI World Model via Renderable Code GenerationGUI future state as executable/renderable code.
gWorldGenerative Visual Code Mobile World ModelsMobile GUI world modeling through generated visual code.
WebWorldWebWorld: A Large-Scale World Model for Web Agent TrainingLong-horizon web-agent simulation over open-web trajectories.
RWMLReinforcement World Model Learning for LLM-based AgentsWorld-model learning coupled to RL-style improvement for LLM agents.

The spelling in the survey table is Word2World and MobileDreamer.

Why This Matters For Alex’s Agenda

Digital world models are the closest conceptual neighbor to the local idea of digital-world robots: agents that can observe, remember, simulate, and act in digital or operational systems. The bridge is strong, but not automatic.

Scaling Test-Time Compute for Agentic Coding adds a narrower but useful agentic-coding lesson. It is not a world model, but it shows that long digital action-observation trajectories are too noisy to reuse raw. Structured summaries become the interface for selecting and refining future attempts. For digital world models, this supports a representation contract: the system needs a compact state of prior experience, but the compression must preserve the transition facts that matter for future actions.

CWM shows the code-domain version:

code / repo context + tool action
  -> program state, output, tests, or files

The target for observability and operations is different:

telemetry + service graph + event stream + operator action
  -> future telemetry, risk, reward, and recovery state

The analogy is useful because both need explicit action-observation trajectories. The boundary is equally important: web/GUI/code world models are usually structured around symbolic or executable software state, while production operations require numeric time series, graph time series, irregular event streams, hidden concurrent actors, delayed effects, failed actions, and human-approval semantics.

Evaluation Pattern

For digital world models, visual fidelity is secondary. The useful tests ask whether the model supports decision-making:

  • Does the predicted rollout preserve DOM, file-system, repository, GUI, or game state over many steps?
  • Does a changed action produce a directionally correct changed future?
  • Does the model handle error branches: permission denial, timeout, missing API, failed command, bad input, stale state, or race?
  • Can predictions be checked by execution, tests, type checks, UI oracles, or replay?
  • Does planning through the model improve real-environment success without exploiting simulator errors?

The L3 boundary should be stricter than ordinary replanning. For operations, evidence-driven model revision should mean governed updates from replayable evidence, with regression tests, canary checks, rollback paths, and audit trails. An agent that changes its plan inside a fixed simulator is still using the model; it is not necessarily revising the model’s governing laws.

Relation To World Models

Digital world models are one branch of World Models. The classic World Models source used visual game trajectories and action-conditioned latent rollout; modern digital world models move the same idea into software environments where much of the state is structured and verifiable.

Genie is a useful visual game-world boundary case: it makes generated environments controllable through learned latent actions, but its state is image/video rather than DOM, API, file-system, or executable program state. It should transfer as an action/interface pattern, not as proof that pixel-level video rollout is enough for software or operations control.

For this wiki, the most valuable digital-world models are the ones that make the action/state contract explicit enough to transfer toward action-conditioned time-series systems. A screenshot-only GUI simulator is less useful than a simulator that preserves the underlying state machine, action log, and outcome semantics.

Open Questions

  • Can digital world models combine learned rollouts with hard symbolic checks, such as type checking, execution, tests, and API contract validation?
  • Which output representation is best for GUI and web worlds: pixels, sketches, DOM diffs, executable code, browser state, or hybrid latents?
  • When should prior digital-agent experience be stored as textual summaries, learned latent state, executable tests, workspace artifacts, or reusable tools?
  • How should a digital world model represent hidden concurrent users and delayed external services?
  • Can web/GUI world-model training teach anything durable about SRE telemetry control, or is the transfer mostly architectural?
  • What benchmark would show that an operations world model improves action ranking, WAIT decisions, and safe human escalation?