RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Source

Core Claim

RDT-1B scales diffusion-action modeling to a 1.2B-parameter bimanual manipulation policy. It uses a Diffusion Transformer to denoise continuous action chunks conditioned on language, visual observations, proprioception, and control-frequency metadata.

Method Notes

  • RDT explicitly targets multimodal continuous bimanual action distributions, where deterministic regression can average incompatible action modes.
  • The model treats proprioception, noisy action chunks, and control frequency as low-dimensional physical quantities, while images and text condition the denoising process.
  • Its unified physical action space is an interface decision for cross-robot training, not a claim that all embodiments have identical dynamics.

Evidence And Limitations

The source reports pretraining on a large multi-robot collection and fine-tuning on more than 6K bimanual trajectories, with real-robot improvements over ACT, OpenVLA, and Octo baselines. The scope remains bimanual manipulation; it is not a general future-observation world model or a full whole-body humanoid controller.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Multi-modal future distributionspartially closesUses diffusion to model multimodal continuous bimanual action chunks instead of deterministic action regression.Future observations and state distributions are not rolled out.
Causal structure, counterfactuals, and controladjacentProduces language-conditioned bimanual control inputs from vision, proprioception, action chunks, and control frequency, which is an analogy for digital-world robot actuation.It remains a physical robot policy, not a general action-conditioned time-series world model.
Context interfaceadjacentTreats language, images, proprioception, and control frequency as conditioning inputs.No general system/channel context or intervention schema for numeric TSFMs.
Native numeric/action encodingadjacentIntroduces a physically interpretable unified action space for heterogeneous robot quantities.The action space is robot-specific and not a general numeric-token interface.

Open Questions

  • How reusable is the physically interpretable unified action space beyond gripper-arm embodiments?
  • Do diffusion action chunks scale better than action tokens as bimanual action dimensionality rises?