Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Source

Core Claim

Open X-Embodiment consolidates many robot-learning datasets into a standardized multi-embodiment repository and shows that RT-X policies can transfer skills across robot platforms.

Sensor-Time-Series Notes

  • The dataset is a large collection of real robot trajectories rather than a passive forecasting benchmark.
  • The relevant time-series unit is a trajectory with image observations, language instructions, and control inputs.
  • The repository uses RLDS to accommodate different action spaces and sensor modalities across robots.
  • The RT-X experiments coarsely align observations and actions by selecting a canonical camera view, resizing images, and mapping controls into a 7-DoF end-effector action representation before discretization.

Model Notes

RT-1-X and RT-2-X represent two common robotics foundation-model interfaces. RT-1-X treats recent image history plus language as inputs to a Transformer policy that emits discretized actions. RT-2-X maps robot actions into language-token-like outputs so a vision-language model can be co-fine-tuned for control.

Open Questions

  • Which parts of the RT-X alignment recipe are necessary for cross-embodiment transfer, and which are artifacts of the available datasets?
  • How should multi-view observations, proprioception, force, tactile, and control-frequency metadata be standardized without erasing embodiment-specific dynamics?