On Training in Imagination

Source

Status And Credibility

This is a 2026-05-07 arXiv preprint, revised on 2026-05-11, by Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, and David Harel, with affiliations listed as Weizmann Institute of Science, New York University, Columbia University, and AMI Labs. It is credible enough to track as an important theory/evaluation source because it directly addresses learned world-model training, reward-model error, sample allocation, and reward-label noise in a recent paper by a strong research team. It is not peer reviewed yet and should be treated as theoretical and controlled-experiment evidence, not as proof that a large-scale RLHF or robotics pipeline should immediately change collection policy.

No official project page, public code repository, lab/author blog, or author X/Twitter announcement was found during ingest. The paper text mentions released code for rollout and subsampling details, but no public repository URL was present in the checked arXiv, paper, review, or GitHub search artifacts. X_BEARER_TOKEN was unavailable locally, so no authenticated X API capture was possible.

Core Claim

Training policies inside a learned world model should not treat dynamics error and reward error as one undifferentiated model-error term. Dynamics transitions and reward annotations have different costs and scaling behavior, so the optimal data-collection strategy can be asymmetric: buy many environment transitions for the dynamics model, and allocate reward annotations according to a separate reward-error and reward-noise tradeoff.

Key Contributions

  • Extends simulation-lemma-style analysis to learned reward models, producing separate dynamics-error and reward-error terms.
  • Identifies lower Lipschitz constants of the learned dynamics, reward, and policy maps as a representation desideratum for tighter return-error bounds.
  • Connects that Lipschitz perspective to temporal straightening in latent world-model rollouts.
  • Uses power-law error assumptions to derive a closed-form optimal ratio between dynamics-transition samples and reward samples.
  • Shows that additive zero-mean reward noise keeps the multi-trajectory REINFORCE estimator unbiased, adding variance that can be reduced with more rollouts.
  • Separates zero-mean reward noise from systematic reward bias: averaging more trajectories cannot remove biased reward gradients.

Method Notes

The paper’s working setting is a deterministic MDP with learned dynamics and learned rewards:

The wiki reading is:

dynamics transitions -> learned dynamics model
reward annotations   -> learned reward model
learned dynamics + learned reward -> imagined rollout return

The important separation is not merely accounting. If N_dyn and N_rew have different unit costs and different power-law error exponents, then one global “more data” rule is too blunt.

flowchart LR
  DynData["dynamics transitions"]
  RewData["reward annotations"]
  Dyn["learned dynamics f_hat"]
  Rew["learned reward r_hat"]
  Rollout["imagined rollout"]
  Policy["policy update"]
  Eval["return error"]

  DynData --> Dyn
  RewData --> Rew
  Dyn --> Rollout
  Rew --> Rollout
  Rollout --> Policy
  Dyn --> Eval
  Rew --> Eval

Evidence And Results

  • The decomposition is theoretical and assumes a Lipschitz setting where return-error terms can be bounded separately.
  • In controlled synthetic and LQG-style checks, the paper reports that the bound holds across the tested configurations but is typically loose.
  • The sample-allocation experiment fits power laws for dynamics and reward error, then compares the closed-form allocation rule against observed allocation behavior.
  • The reward-noise section shows a useful positive result only for zero-mean additive noise.
  • The biased-reward proposition is the main operational warning: systematic reward bias survives trajectory averaging and can steer optimization in the wrong direction.

Limitations

  • The paper is theory-heavy and uses controlled synthetic/LQG-style experiments; it does not evaluate a production RLHF pipeline, a frontier robot policy, or an observability-control system.
  • The main bound assumes deterministic dynamics and contraction-style Lipschitz conditions; the paper leaves stochastic dynamics and harder regimes as future work.
  • The bound is empirically conservative in the reported checks, so the exact constants should not be treated as operationally calibrated for real systems.
  • The “cheap noisy labels can win” lesson depends on zero-mean noise. Biased labels are a different failure mode.
  • The paper text mentions released code, but no public code URL was found during ingest.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Causal structure, counterfactuals, and controladjacentSeparates learned transition dynamics from learned reward/value signals inside imagined rollouts.No numeric telemetry, operator interventions, stochastic delayed effects, or real control benchmark.
Data and scaling substrateadjacentApplies power-law error assumptions to split a fixed sample budget between dynamics transitions and reward annotations.Synthetic/LQG evidence; no large cross-domain TSFM scaling law.
Benchmark and evaluation hygienewarningShows why dynamics error, reward error, reward noise, and reward bias should be measured separately.Need reproducible protocols for real RLHF, robotics, and digital-operations reward collection.
Representation qualityadjacentConnects lower Lipschitz constants and temporal straightening to tighter return-error coefficients.No direct evidence for high-dimensional multivariate time-series representations.

Open Questions

  • What would the same dynamics/reward budget split look like for observability control, where labels can be SLO-based, human-reviewed, delayed, or partly synthetic?
  • Can reward annotation noise in RLHF or robotics be treated as zero-mean often enough for the “many cheap labels” result to be useful?
  • Which representation-learning methods actually reduce Lipschitz constants without increasing dynamics error?
  • How should a benchmark report dynamics error, reward error, reward bias, rollout value, and live-transfer success without collapsing them into one score?
  • Where is the public code referenced by the paper text, and does it reproduce the synthetic/LQG allocation experiments?