---
abstract: |
  Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train \`\`generalist'' `\cro`{=latex}-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective `\cro`{=latex}-robot policies. We assemble a dataset from $\numembodiment$ different robots collected through a collaboration between $\numinstitution$ institutions, demonstrating $\numskill$ skills ($\numtask$ tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is `\repourl`{=latex}.
author:
- |
  **O**pen **X**-**E**mbodiment Collaboration`\vspace{0.01em}`{=latex}$^{0}$\
  `\repourl `{=latex}`\vspace{0.03em}`{=latex}\
  `\footnotesize `{=latex}Abby O'Neill$^{34}$, Abdul Rehman$^{37}$, Abhinav Gupta$^{4}$, Abhiram Maddukuri$^{45}$, Abhishek Gupta$^{46}$, Abhishek Padalkar$^{10}$, Abraham Lee$^{34}$,\
  Acorn Pooley$^{11}$, Agrim Gupta$^{28}$, Ajay Mandlekar$^{22}$, Ajinkya Jain$^{15}$, Albert Tung$^{28}$, Alex Bewley$^{11}$, Alex Herzog$^{11}$, Alex Irpan$^{11}$,\
  Alexander Khazatsky$^{28}$, Anant Rai$^{23}$, Anchit Gupta$^{19}$, Andrew Wang$^{34}$, Andrey Kolobov$^{20}$, Anikait Singh$^{11, 34}$, Animesh Garg$^{9}$,\
  Aniruddha Kembhavi$^{1}$, Annie Xie$^{28}$, Anthony Brohan$^{11}$, Antonin Raffin$^{10}$, Archit Sharma$^{28}$, Arefeh Yavary$^{35}$, Arhan Jain$^{46}$, Ashwin Balakrishna$^{32}$,\
  Ayzaan Wahid$^{11}$, Ben Burgess-Limerick$^{25}$, Beomjoon Kim$^{17}$, Bernhard Schölkopf$^{18}$, Blake Wulfe$^{32}$, Brian Ichter$^{11}$, Cewu Lu$^{27, 8}$, Charles Xu$^{34}$,\
  Charlotte Le$^{34}$, Chelsea Finn$^{11, 28}$, Chen Wang$^{28}$, Chenfeng Xu$^{34}$, Cheng Chi$^{5, 28}$, Chenguang Huang$^{38}$, Christine Chan$^{11}$,\
  Christopher Agia$^{28}$, Chuer Pan$^{28}$, Chuyuan Fu$^{11}$, Coline Devin$^{11}$, Danfei Xu$^{9}$, Daniel Morton$^{28}$, Danny Driess$^{11}$, Daphne Chen$^{46}$, Deepak Pathak$^{4}$,\
  Dhruv Shah$^{34}$, Dieter Büchler$^{18}$, Dinesh Jayaraman$^{42}$, Dmitry Kalashnikov$^{11}$, Dorsa Sadigh$^{11}$, Edward Johns$^{14}$, Ethan Foster$^{28}$,\
  Fangchen Liu$^{34}$, Federico Ceola$^{16}$, Fei Xia$^{11}$, Feiyu Zhao$^{13}$, Felipe Vieira Frujeri$^{20}$, Freek Stulp$^{10}$, Gaoyue Zhou$^{23}$, Gaurav S. Sukhatme$^{43}$,\
  Gautam Salhotra$^{43, 15}$, Ge Yan$^{36}$, Gilbert Feng$^{34}$, Giulio Schiavi$^{7}$, Glen Berseth$^{41, 21}$, Gregory Kahn$^{34}$, Guangwen Yang$^{33}$,\
  Guanzhi Wang$^{3, 22}$, Hao Su$^{36}$, Hao-Shu Fang$^{27}$, Haochen Shi$^{28}$, Henghui Bao$^{43}$, Heni Ben Amor$^{2}$, Henrik I Christensen$^{36}$, Hiroki Furuta$^{31}$,\
  Homanga Bharadhwaj$^{4, 19}$, Homer Walke$^{34}$, Hongjie Fang$^{27}$, Huy Ha$^{5, 28}$, Igor Mordatch$^{11}$, Ilija Radosavovic$^{34}$, Isabel Leal$^{11}$,\
  Jacky Liang$^{11}$, Jad Abou-Chakra$^{25}$, Jaehyung Kim$^{17}$, Jaimyn Drake$^{34}$, Jan Peters$^{29}$, Jan Schneider$^{18}$, Jasmine Hsu$^{11}$, Jay Vakil$^{19}$,\
  Jeannette Bohg$^{28}$, Jeffrey Bingham$^{11}$, Jeffrey Wu$^{34}$, Jensen Gao$^{28}$, Jiaheng Hu$^{30}$, Jiajun Wu$^{28}$, Jialin Wu$^{12}$, Jiankai Sun$^{28}$, Jianlan Luo$^{34}$,\
  Jiayuan Gu$^{36}$, Jie Tan$^{11}$, Jihoon Oh$^{31}$, Jimmy Wu$^{24}$, Jingpei Lu$^{36}$, Jingyun Yang$^{28}$, Jitendra Malik$^{34}$, João Silvério$^{10}$, Joey Hejna$^{28}$,\
  Jonathan Booher$^{28}$, Jonathan Tompson$^{11}$, Jonathan Yang$^{28}$, Jordi Salvador$^{1}$, Joseph J. Lim$^{17}$, Junhyek Han$^{17}$, Kaiyuan Wang$^{36}$,\
  Kanishka Rao$^{11}$, Karl Pertsch$^{34, 28}$, Karol Hausman$^{11}$, Keegan Go$^{15}$, Keerthana Gopalakrishnan$^{11}$, Ken Goldberg$^{34}$, Kendra Byrne$^{11}$,\
  Kenneth Oslund$^{11}$, Kento Kawaharazuka$^{31}$, Kevin Black$^{34}$, Kevin Lin$^{28}$, Kevin Zhang$^{4}$, Kiana Ehsani$^{1}$, Kiran Lekkala$^{43}$, Kirsty Ellis$^{41}$,\
  Krishan Rana$^{25}$, Krishnan Srinivasan$^{28}$, Kuan Fang$^{34}$, Kunal Pratap Singh$^{6}$, Kuo-Hao Zeng$^{1}$, Kyle Hatch$^{32}$, Kyle Hsu$^{28}$, Laurent Itti$^{43}$,\
  Lawrence Yunliang Chen$^{34}$, Lerrel Pinto$^{23}$, Li Fei-Fei$^{28}$, Liam Tan$^{34}$, Linxi \"Jim\" Fan$^{22}$, Lionel Ott$^{7}$, Lisa Lee$^{11}$, Luca Weihs$^{1}$,\
  Magnum Chen$^{13}$, Marion Lepert$^{28}$, Marius Memmel$^{46}$, Masayoshi Tomizuka$^{34}$, Masha Itkina$^{32}$, Mateo Guaman Castro$^{46}$, Max Spero$^{28}$, Maximilian Du$^{28}$,\
  Michael Ahn$^{11}$, Michael C. Yip$^{36}$, Mingtong Zhang$^{39}$, Mingyu Ding$^{34}$, Minho Heo$^{17}$, Mohan Kumar Srirama$^{4}$, Mohit Sharma$^{4}$,\
  Moo Jin Kim$^{28}$, Muhammad Zubair Irshad$^{32}$, Naoaki Kanazawa$^{31}$, Nicklas Hansen$^{36}$, Nicolas Heess$^{11}$, Nikhil J Joshi$^{11}$, Niko Suenderhauf$^{25}$,\
  Ning Liu$^{13}$, Norman Di Palo$^{14}$, Nur Muhammad Mahi Shafiullah$^{23}$, Oier Mees$^{38}$, Oliver Kroemer$^{4}$, Osbert Bastani$^{42}$, Pannag R Sanketi$^{11}$,\
  Patrick \"Tree\" Miller$^{32}$, Patrick Yin$^{46}$, Paul Wohlhart$^{11}$, Peng Xu$^{11}$, Peter David Fagan$^{37}$, Peter Mitrano$^{40}$, Pierre Sermanet$^{11}$, Pieter Abbeel$^{34}$,\
  Priya Sundaresan$^{28}$, Qiuyu Chen$^{46}$, Quan Vuong$^{11}$, Rafael Rafailov$^{11, 28}$, Ran Tian$^{34}$, Ria Doshi$^{34}$, Roberto Mart'in-Mart'in$^{30}$,\
  Rohan Baijal$^{46}$, Rosario Scalise$^{46}$, Rose Hendrix$^{1}$, Roy Lin$^{34}$, Runjia Qian$^{13}$, Ruohan Zhang$^{28}$, Russell Mendonca$^{4}$, Rutav Shah$^{30}$,\
  Ryan Hoque$^{34}$, Ryan Julian$^{11}$, Samuel Bustamante$^{10}$, Sean Kirmani$^{11}$, Sergey Levine$^{11, 34}$, Shan Lin$^{36}$, Sherry Moore$^{11}$, Shikhar Bahl$^{4}$,\
  Shivin Dass$^{43, 30}$, Shubham Sonawani$^{2}$, Shubham Tulsiani$^{4}$, Shuran Song$^{5}$, Sichun Xu$^{11}$, Siddhant Haldar$^{23}$, Siddharth Karamcheti$^{28}$,\
  Simeon Adebola$^{34}$, Simon Guist$^{18}$, Soroush Nasiriany$^{30}$, Stefan Schaal$^{15}$, Stefan Welker$^{11}$, Stephen Tian$^{28}$, Subramanian Ramamoorthy$^{37}$,\
  Sudeep Dasari$^{4}$, Suneel Belkhale$^{28}$, Sungjae Park$^{17}$, Suraj Nair$^{32}$, Suvir Mirchandani$^{28}$, Takayuki Osa$^{31}$, Tanmay Gupta$^{1}$, Tatsuya Harada$^{31, 26}$,\
  Tatsuya Matsushima$^{31}$, Ted Xiao$^{11}$, Thomas Kollar$^{32}$, Tianhe Yu$^{11}$, Tianli Ding$^{11}$, Todor Davchev$^{11}$, Tony Z. Zhao$^{28}$,\
  Travis Armstrong$^{11}$, Trevor Darrell$^{34}$, Trinity Chung$^{34}$, Vidhi Jain$^{11, 4}$, Vikash Kumar$^{4}$, Vincent Vanhoucke$^{11}$, Vitor Guizilini$^{32}$, Wei Zhan$^{34}$,\
  Wenxuan Zhou$^{11, 4}$, Wolfram Burgard$^{44}$, Xi Chen$^{11}$, Xiangyu Chen$^{13}$, Xiaolong Wang$^{36}$, Xinghao Zhu$^{34}$, Xinyang Geng$^{34}$, Xiyuan Liu$^{13}$,\
  Xu Liangwei$^{13}$, Xuanlin Li$^{36}$, Yansong Pang$^{11}$, Yao Lu$^{11}$, Yecheng Jason Ma$^{42}$, Yejin Kim$^{1}$, Yevgen Chebotar$^{11}$, Yifan Zhou$^{2}$,\
  Yifeng Zhu$^{30}$, Yilin Wu$^{4}$, Ying Xu$^{11}$, Yixuan Wang$^{39}$, Yonatan Bisk$^{4}$, Yongqiang Dou$^{33}$, Yoonyoung Cho$^{17}$, Youngwoon Lee$^{34}$, Yuchen Cui$^{28}$,\
  Yue Cao$^{13}$, Yueh-Hua Wu$^{36}$, Yujin Tang$^{11, 31}$, Yuke Zhu$^{30}$, Yunchu Zhang$^{46}$, Yunfan Jiang$^{28}$, Yunshuang Li$^{42}$, Yunzhu Li$^{39}$,\
  Yusuke Iwasawa$^{31}$, Yutaka Matsuo$^{31}$, Zehan Ma$^{34}$, Zhuo Xu$^{11}$, Zichen Jeff Cui$^{23}$, Zichen Zhang$^{1}$, Zipeng Fu$^{28}$, Zipeng Lin$^{34}$`\vspace{-0.6em}`{=latex}
bibliography:
- references.bib
title: '`\LARGE `{=latex} **Open X-Embodiment: Robotic Learning Datasets and RT-X Models `\vspace{-1.8em}`{=latex}** '
---

```{=latex}
\newcommand{\todo}[1]{\textcolor{red}{TODO: #1}}
```
```{=latex}
\let\NAT@parse\undefined
```
```{=latex}
\newcommand{\repoacronym}{OXE\xspace}
```
```{=latex}
\newcommand{\repo}{Open X-Embodiment\xspace}
```
```{=latex}
\newcommand{\repourl}{\href{https://robotics-transformer-x.github.io}{robotics-transformer-x.github.io}}
```
```{=latex}
\newcommand{\numembodiment}{22}
```
```{=latex}
\newcommand{\numinstitution}{21}
```
```{=latex}
\newcommand{\numresearchlab}{34}
```
```{=latex}
\newcommand{\numdatasets}{60}
```
```{=latex}
\newcommand{\numskill}{527}
```
```{=latex}
\newcommand{\numtask}{160266}
```
```{=latex}
\newcommand{\edrbot}{Google Robot\xspace}
```
```{=latex}
\newcommand{\cro}{X}
```
```{=latex}
\newcommand{\numembodimentusedintraining}{9}
```
```{=latex}
\newcommand{\nummanipulatorusedintraining}{9}
```
```{=latex}
\makeatletter
```
```{=latex}
\let\@oldmaketitle\@maketitle
```
```{=latex}
\renewcommand{\@maketitle}{\@oldmaketitle%
\vspace{0.15em}
  \includegraphics[width=\linewidth]
    {figures/data_vis_v3.jpg}
    \captionof{figure}{\small We propose an open, large-scale dataset for robot learning curated from $\numinstitution$ institutions across the globe. The dataset represents diverse behaviors, robot embodiments and environments, and enables learning generalized robotic policies.}
    \vspace{-1.35em}
    \label{fig:teaser}
    }
```
```{=latex}
\makeatother
```
```{=latex}
\maketitle
```
```{=latex}
\pagestyle{empty}
```
```{=latex}
\addtocounter{figure}{-1}
```
```{=latex}
\vspace{-0.4em}
```
```{=latex}
\vspace{-0.1cm}
```
Introduction
============

```{=latex}
\vspace{-0.1cm}
```
```{=latex}
\footnotetext{\scriptsize $^{1}$Allen Institute for AI; $^{2}$Arizona State University; $^{3}$California Institute of Technology; $^{4}$Carnegie Mellon University; $^{5}$Columbia University; $^{6}$EPFL; $^{7}$ETH Zürich; $^{8}$Flexiv Robotics; $^{9}$Georgia Institute of Technology; $^{10}$German Aerospace Center; $^{11}$Google DeepMind; $^{12}$Google Research; $^{13}$IO-AI TECH; $^{14}$Imperial College London; $^{15}$Intrinsic LLC; $^{16}$Istituto Italiano di Tecnologia; $^{17}$Korea Advanced Institute of Science \& Technology; $^{18}$Max Planck Institute; $^{19}$Meta AI; $^{20}$Microsoft Research; $^{21}$Mila Quebec; $^{22}$NVIDIA; $^{23}$New York University; $^{24}$Princeton University; $^{25}$Queensland University of Technology; $^{26}$RIKEN; $^{27}$Shanghai Jiao Tong University; $^{28}$Stanford University; $^{29}$Technische Universität Darmstadt; $^{30}$The University of Texas at Austin; $^{31}$The University of Tokyo; $^{32}$Toyota Research Institute; $^{33}$Tsinghua University; $^{34}$University of California, Berkeley; $^{35}$University of California, Davis; $^{36}$University of California, San Diego; $^{37}$University of Edinburgh; $^{38}$University of Freiburg; $^{39}$University of Illinois Urbana-Champaign; $^{40}$University of Michigan; $^{41}$University of Montreal; $^{42}$University of Pennsylvania; $^{43}$University of Southern California; $^{44}$University of Technology, Nuremberg; $^{45}$University of Texas at Austin; $^{46}$University of Washington
}
```
A central lesson from advances in machine learning and artificial intelligence is that large-scale learning from diverse datasets can enable capable AI systems by providing for general-purpose pretrained models. In fact, large-scale general-purpose models typically trained on large and diverse datasets can often outperform their *narrowly targeted* counterparts trained on smaller but more task-specific data. For instance, open-vocab classifiers (e.g., CLIP [@radford2021learning]) trained on large datasets scraped from the web tend to outperform fixed-vocabulary models trained on more limited datasets, and large language models [@openai2023gpt4; @anil2023palm] trained on massive text corpora tend to outperform systems that are only trained on narrow task-specific datasets. Increasingly, the most effective way to tackle a given narrow task (e.g., in vision or NLP) is to adapt a general-purpose model. However, these lessons are difficult to apply in robotics: any single robotic domain might be too narrow, and while computer vision and NLP can leverage large datasets sourced from the web, comparably large and broad datasets for robotic interaction are hard to come by. Even the largest data collection efforts still end up with datasets that are a fraction of the size and diversity of benchmark datasets in vision (5-18M) [@Weyand_2020_CVPR; @tencent-ml-images-2019] and NLP (1.5B-4.5B) [@lehmann-2017; @muhleisen2012web]. More importantly, such datasets are often still narrow along some axes of variation, either focusing on a single environment, a single set of objects, or a narrow range of tasks. How can we overcome these challenges in robotics and move the field of robotic learning toward large data regime that has been so successful in other domains?

Inspired by the generalization made possible by pretraining large vision or language models on diverse data, we take the perspective that the goal of training generalizable robot policies requires **`\cro`{=latex}-embodiment training**, i.e., with data from multiple robotic platforms. While each individual robotic learning dataset might be too narrow, the union of all such datasets provides a better coverage of variations in environments and robots. Learning generalizable robot policies requires developing methods that can utilize `\cro`{=latex}-embodiment data, tapping into datasets from many labs, robots, and settings. Even if such datasets in their current size and coverage are insufficient to attain the impressive generalization results that have been demonstrated by large language models, in the future, the union of such data can potentially provide this kind of coverage. Because of this, **we believe that enabling research into `\cro`{=latex}-embodiment robotic learning is critical at the present juncture**.

Following this rationale, we have two goals: **(1)** Evaluate whether policies trained on data from many different robots and environments enjoy the benefits of positive transfer, attaining better performance than policies trained only on data from each evaluation setup. **(2)** Organize large robotic datasets to enable future research on `\cro`{=latex}-embodiment models.

We focus our work on robotic manipulation. Addressing goal **(1)**, our empirical contribution is to demonstrate that several recent robotic learning methods, with minimal modification, can utilize `\cro`{=latex}-embodiment data and enable positive transfer. Specifically, we train the RT-1 [@brohan2023rt1] and RT-2 [@brohan2023rt2] models on $\nummanipulatorusedintraining$ different robotic manipulators. We show that the resulting models, which we call RT-X, can improve over policies trained only on data from the evaluation domain, exhibiting better generalization and new capabilities. Addressing **(2)**, we provide the `\repo `{=latex}(`\repoacronym`{=latex}) Repository, which includes a dataset with $\numembodiment$ different robotic embodiments from $\numinstitution$ different institutions that can enable the robotics community to pursue further research on `\cro`{=latex}-embodiment models, along with open-source tools to facilitate such research. Our aim is not to innovate in terms of the particular architectures and algorithms, but rather to provide the model that we trained together with data and tools to energize research around `\cro`{=latex}-embodiment robotic learning.

```{=latex}
\begin{figure*}
    \centering
    \includegraphics[width=0.86\linewidth]
    {figures/rtx_data_analysis.pdf}
    \vspace{-0.25em}
    \caption{\small The \repo Dataset. \textbf{(a)}: the dataset consists of \numdatasets~individual datasets across $\numembodiment$ embodiments. \textbf{(b)}: the Franka robot has the largest diversity in visually distinct scenes due to the large number of Franka datasets, \textbf{(c)}: xArm and Google Robot contribute the most number of trajectories due to a few large datasets, \textbf{(d, e)}: the dataset contains a great diversity of skills and common objects. 
    \label{fig:data_vis}}
    \vspace{-1.8em}
\end{figure*}
```
```{=latex}
\vspace{-0.5em}
```
Related Work
============

```{=latex}
\vspace{-0.15em}
```
`\noindent `{=latex}**Transfer across embodiments.** A number of prior works have studied methods for transfer across robot embodiments in simulation [@devin2017learning; @chen2018hardware; @sanchez2018graph; @pathak2019learning; @martin2019iros; @huang2020smp; @kurin2020my; @zakka2021xirl; @ghadirzadeh2021bayesian; @gupta2021metamorph; @schubert2023generalist; @shah2023gnm; @Zhou2023Modularity] and on real robots [@dasari2019robonet; @hu2022know; @bousmalis2023robocat; @yang2023polybot; @reed2022a; @salhotra2023bridging; @radosavovic2023robot]. These methods often introduce mechanisms specifically designed to address the embodiment gap between different robots, such as shared action representations [@martin2019iros; @shao2020unigrasp], incorporating representation learning objectives [@zakka2021xirl; @yang2023polybot], adapting the learned policy on embodiment information [@shao2020unigrasp; @xu2021adagrasp; @chen2018hardware; @ghadirzadeh2021bayesian; @huang2020smp], and decoupling robot and environment representations [@hu2022know]. Prior work has provided initial demonstrations of `\cro`{=latex}-embodiment training [@reed2022a] and transfer [@bousmalis2023robocat; @shah2023vint; @radosavovic2023robot] with transformer models. We investigate complementary architectures and provide complementary analyses, and, in particular, study the interaction between `\cro`{=latex}-embodiment transfer and web-scale pretraining. Similarly, methods for transfer across human and robot embodiments also often employ techniques for reducing the embodiment gap, i.e. by translating between domains or learning transferable representations [@liu2018imitation; @yu2018one; @sharma2019third; @smith2019avid; @bonardi2020learning; @schmeckpeper2021reinforcement; @xiong2021learning; @jang2022bc; @bahl2022whirl; @ding2023embodied; @Bahl_2023_CVPR]. Alternatively, some works focus on sub-aspects of the problem such as learning transferable reward functions [@sermanet2016unsupervised; @zakka2021xirl; @shao2020concept; @chen2021learning; @kumar2023graph; @alakuijala2023learning], goals [@zhou2021manipulator; @wang2023mimicplay], dynamics models [@schmeckpeper2020learning], or visual representations [@nair2022r3m; @xiao2022masked; @radosavovic2022real; @ma2022vip; @majumdar2023we; @karamcheti2023language; @mu2023ec2; @bahl2023affordances] from human video data. Unlike most of these prior works, we directly train a policy on `\cro`{=latex}-embodiment data, without any mechanisms to reduce the embodiment gap, and observe positive transfer by leveraging that data.

`\noindent `{=latex}**Large-scale robot learning datasets.** The robot learning community has created open-source robot learning datasets, spanning grasping [@jiang2011efficient; @Pinto2015SupersizingSL; @bohg2015dataset; @Mahler2017DexNet2D; @depierre2018jacquard; @levine2018learning; @kalashnikov2018qt; @Brahmbhatt2019; @fang2020graspnet; @acronym2020; @bousmalis2018grasping; @zhu2023fanuc], pushing interactions [@yu2016more; @finn2017deep; @ebert2018visual; @dasari2019robonet], sets of objects and models [@shilane_princeton_2004; @wohlkinger_3dnet_2012; @kit2012; @singh_bigbird_2014; @Calli2015YCB; @zhirong_wu_3d_2015; @xiang_objectnet3d_2016; @morrison2020egad; @gao2021objectfolder; @downs2022google; @kalashnikov2021mt], and teleoperated demonstrations [@DBLP:journals/corr/abs-1811-02790; @sharma2018multiple; @mandlekar2019scaling; @ebert2021bridge; @robomimic2021; @brohan2023rt1; @lynch2023interactive; @fang2023rh20t; @roboagent; @heo2023furniturebench; @walke2023bridgedata]. With the exception of RoboNet [@dasari2019robonet], these datasets contain data of robots of the same type, whereas we focus on data spanning multiple embodiments. The goal of our data repository is complementary to these efforts: we process and aggregate a large number of prior datasets into a single, standardized repository, called `\repo`{=latex}, which shows how robot learning datasets can be shared in a meaningul and useful way.

`\noindent `{=latex}**Language-conditioned robot learning.** Prior work has aimed to endow robots and other agents with the ability to understand and follow language instructions [@WINOGRAD19721; @macmahon2006; @kollar2010; @chen2011; @duvallet2015; @Luketina2019ASO], often by learning language-conditioned policies [@shao2020concept; @stepputtis2020language; @nair2022learning; @mees2022calvin; @mees2022matters; @jang2022bc; @shridhar2022perceiver; @brohan2023rt1]. We train language-conditioned policies via imitation learning like many of these prior works but do so using large-scale multi-embodiment demonstration data. Following previous works that leverage pre-trained language embeddings [@hill2020human; @shao2020concept; @lynch2021grounding; @nair2022learning; @jang2022bc; @ahn2022saycan; @jiang2022vima; @brohan2023rt1; @vemprala2023chatgpt; @huang2023voxposer] and pre-trained vision-language models [@shridhar2022cliport; @stone2023moo; @mu2023embodiedgpt; @brohan2023rt2] in robotic imitation learning, we study both forms of pre-training in our experiments, specifically following the recipes of RT-1 [@brohan2023rt1] and RT-2 [@brohan2023rt2].

```{=latex}
\begin{figure*}[h]\centering
    \includegraphics[width=0.88\textwidth]{figures/rt12_teaser_model_vertical.pdf}
    \vspace{-0.1em}
    \caption{
    \small RT-1-X and RT-2-X both take images and a text instruction as input and output discretized end-effector actions. 
    RT-1-X is an architecture designed for robotics, with a FiLM~\cite{perez2017film} conditioned EfficientNet~\cite{tan2019efficientnet} and a Transformer~\cite{vaswani2017attention}.
    RT-2-X builds on a VLM backbone by representing actions as another language, and training action text tokens together with vision-language data.
    }
\label{fig:rt12}
\vspace{-1.8em}
\end{figure*}
```
```{=latex}
\vspace{-0.3em}
```
The `\repo `{=latex}Repository
==============================

```{=latex}
\vspace{-0.02em}
```
We introduce the `\repo `{=latex}Repository (`\repourl`{=latex}) -- an open-source repository which includes **large-scale data** along with **pre-trained model checkpoints** for `\cro`{=latex}-embodied robot learning research. More specifically, we provide and maintain the following open-source resources to the broader community:

-   **`\repo `{=latex}Dataset**: robot learning dataset with *1M+ robot trajectories* from *$\numembodiment$ robot embodiments*.

-   **Pre-Trained Checkpoints**: a selection of RT-X model checkpoints ready for inference and finetuning.

```{=latex}
\vspace{-0.2em}
```
We intend for these resources to form a foundation for `\cro`{=latex}-embodiment research in robot learning, but they are just the start. `\repo `{=latex}is a community-driven effort, currently involving $\numinstitution$ institutions from around the world, and we hope to further broaden participation and grow the initial `\repo `{=latex}Dataset over time. In this section, we summarize the dataset and `\cro`{=latex}-embodiment learning framework, before discussing the specific models we use to evaluate our dataset and our experimental results.

```{=latex}
\vspace{-0.3em}
```
The `\repo `{=latex}Dataset
---------------------------

```{=latex}
\vspace{-0.1em}
```
The `\repo `{=latex}Dataset contains 1M+ real robot trajectories spanning $\numembodiment$ robot embodiments, from single robot arms to bi-manual robots and quadrupeds. The dataset was constructed by pooling `\numdatasets`{=latex} *existing* robot datasets from $\numresearchlab$ robotic research labs around the world and converting them into a consistent data format for easy download and usage. We use the [RLDS](https://github.com/google-research/rlds) data format [@ramos2021rlds], which saves data in serialized [`tfrecord`](https://www.tensorflow.org/tutorials/load_data/tfrecord) files and accommodates the various action spaces and input modalities of different robot setups, such as differing numbers of RGB cameras, depth cameras and point clouds. It also supports efficient, parallelized data loading in all major deep learning frameworks. For more details about the data storage format and a breakdown of all `\numdatasets`{=latex} datasets, see `\repourl`{=latex}.

```{=latex}
\vspace{-0.25em}
```
Dataset Analysis
----------------

```{=latex}
\vspace{-0.2em}
```
`\cref{fig:data_vis}`{=latex} analyzes the `\repo `{=latex}Dataset. `\cref{fig:data_vis}`{=latex}(a) shows the breakdown of datasets by robot embodiments, with the Franka robot being the most common. This is reflected in the number of distinct scenes (based on dataset metadata) per embodiment (`\cref{fig:data_vis}`{=latex}(b)), where Franka dominates. `\cref{fig:data_vis}`{=latex}(c) shows the breakdown of trajectories per embodiment. To further analyze the diversity, we use the language annotations present in our data. We use the PaLM language model [@anil2023palm] to extract objects and behaviors from the instructions. `\cref{fig:data_vis}`{=latex}(d,e) show the diversity of skills and objects. While most skills belong to the pick-place family, the long tail of the dataset contains skills like \`\`wiping" or \`\`assembling". Additionally, the data covers a range of household objects, from appliances to food items and utensils.

```{=latex}
\vspace{-0.2em}
```
RT-X Design
===========

```{=latex}
\vspace{-0.16em}
```
To evaluate how much `\cro`{=latex}-embodiment training can improve the performance of learned policies on individual robots, we require models that have sufficient capacity to productively make use of such large and heterogeneous datasets. To that end, our experiments will build on two recently proposed Transformer-based robotic policies: RT-1 [@brohan2023rt1] and RT-2 [@brohan2023rt2]. We briefly summarize the design of these models in this section, and discuss how we adapted them to the `\cro`{=latex}-embodiment setting in our experiments.

```{=latex}
\vspace{-0.3em}
```
Data format consolidation
-------------------------

```{=latex}
\vspace{-0.15em}
```
One challenge of creating `\cro`{=latex}-embodiment models is that observation and action spaces vary significantly across robots. We use a coarsely aligned action and observation space across datasets. The model receives a history of recent images and language instructions as observations and predicts a 7-dimensional action vector controlling the end-effector ($x$, $y$, $z$, roll, pitch, yaw, and gripper opening or the rates of these quantities). We select one canonical camera view from each dataset as the input image, resize it to a common resolution and convert the original action set into a 7 DoF end-effector action. We normalize each dataset's actions prior to discretization. This way, an output of the model can be interpreted (de-normalized) differently depending on the embodiment used. It should be noted that despite this coarse alignment, the camera observations still vary substantially across datasets, e.g. due to differing camera poses relative to the robot or differing camera properties, see Figure `\ref{fig:rt12}`{=latex}. Similarly, for the action space, we do not align the coordinate frames across datasets in which the end-effector is controlled, and allow action values to represent either absolute or relative positions or velocities, as per the original control scheme chosen for each robot. Thus, the same action vector may induce very different motions for different robots.

```{=latex}
\vspace{-0.2em}
```
Policy architectures
--------------------

```{=latex}
\vspace{-0.1em}
```
We consider two model architectures in our experiments: (1) RT-1 [@brohan2023rt1], an efficient Transformer-based architecture designed for robotic control, and (2) RT-2 [@brohan2023rt2] a large vision-language model co-fine-tuned to output robot actions as natural language tokens. Both models take in a visual input and natural language instruction describing the task, and output a tokenized action. For each model, the action is tokenized into 256 bins uniformly distributed along each of eight dimensions; one dimension for terminating the episode and seven dimensions for end-effector movement. Although both architectures are described in detail in their original papers [@brohan2023rt1; @brohan2023rt2], we provide a short summary of each below:

**RT-1 [@brohan2023rt1]** is a 35M parameter network built on a Transformer architecture [@vaswani2017attention] and designed for robotic control, as shown in `\cref{fig:rt12}`{=latex}. It takes in a history of $15$ images along with the natural language. Each image is processed through an ImageNet-pretrained EfficientNet [@tan2019efficientnet] and the natural language instruction is transformed into a USE [@cer2018universal] embedding. The visual and language representations are then interwoven via FiLM [@perez2017film] layers, producing 81 vision-language tokens. These tokens are fed into a decoder-only Transformer, which outputs the tokenized actions.

```{=latex}
\begin{figure*}[!htbp]\centering
    \includegraphics[width=0.9\linewidth]{figures/rtx_barplot_10.png}
    \caption{\small RT-1-X mean success rate is $50\%$ higher than that of either the Original Method or RT-1. RT-1 and RT-1-X have the same network architecture. Therefore the performance increase can be attributed to co-training on the robotics data mixture. 
    The lab logos indicate the physical location of real robot evaluation, and the robot pictures indicate the embodiment used for the evaluation.}
    \label{fig:rt_x_barplot}
    \vspace{-1.9em}
\end{figure*}
```
**RT-2 [@brohan2023rt2]** is a family of large vision-language-*action* models (VLAs) trained on Internet-scale vision and language data along with robotic control data. RT-2 casts the tokenized actions to text tokens, e.g., a possible action may be \`\`1 128 91 241 5 101 127". As such, any pretrained vision-language model (VLM [@chen2023palix; @alayrac2022flamingo; @driess2023palme]) can be finetuned for robotic control, thus leveraging the backbone of VLMs and transferring some of their generalization properties. In this work, we focus on the RT-2-PaLI-X variant [@chen2023palix] built on a backbone of a visual model, ViT [@dosovitskiy2021image], and a language model, UL2 [@tay2023ul2], and pretrained primarily on the WebLI [@chen2023palix] dataset.

```{=latex}
\vspace{-0.1em}
```
Training and inference details
------------------------------

```{=latex}
\vspace{-0.1em}
```
`\label{sec:training_details}`{=latex}

Both models use a standard categorical cross-entropy objective over their output space (discrete buckets for RT-1 and all possible language tokens for RT-2).

We define the robotics data mixture used across all of the experiments as the data from $\nummanipulatorusedintraining$ manipulators, and taken from RT-1 [@brohan2023rt1], QT-Opt [@kalashnikov2018qt], Bridge [@walke2023bridgedata], Task Agnostic Robot Play [@rosetebeas2022latent; @mees2023grounding], Jaco Play [@dass2023jacoplay], Cable Routing [@luo2023multistage], RoboTurk [@DBLP:journals/corr/abs-1811-02790], NYU VINN [@pari2021surprising], Austin VIOLA [@zhu2023viola], Berkeley Autolab UR5 [@BerkeleyUR5Website], TOTO [@zhou2023train] and Language Table [@lynch2023interactive] datasets. RT-1-X is trained on only robotics mixture data defined above, whereas RT-2-X is trained via co-fine-tuning (similarly to the original RT-2 [@brohan2023rt2]), with an approximately one to one split of the original VLM data and the robotics data mixture. Note that the robotics data mixture used in our experiments includes $\numembodimentusedintraining$ embodiments which is fewer than the entire `\repo `{=latex}dataset ($\numembodiment$) -- the practical reason for this difference is that we have continued to extend the dataset over time, and at the time of the experiments, the dataset above represented all of the data. In the future, we plan to continue training policies on the extended versions of the dataset as well as continue to grow the dataset together with the robot learning community.

At inference time, each model is run at the rate required for the robot (3-10 Hz), with RT-1 run locally and RT-2 hosted on a cloud service and queried over the network.

```{=latex}
\vspace{1em}
```
```{=latex}
\centering
```
```{=latex}
\scriptsize
```
```{=latex}
\setlength{\tabcolsep}{3pt}
```
::: {#tab:result_on_dataset_with_more_than_5k_episodes}
  Evaluation Setting               Bridge                        Bridge              RT-1 paper 6 skills
  --------------------- ----------------------------- ----------------------------- ---------------------
  Evaluation Location          IRIS (Stanford)               RAIL Lab (UCB)          Google Robotic Lab
  Robot Embodiment                 WidowX                        WidowX              `\edrbot `{=latex}
  Original Method        LCBC [@walke2023bridgedata]   LCBC [@walke2023bridgedata]           \-
  Original Method                  $13\%$                        $13\%$                      \-
  RT-1                             $40\%$                    $\textbf{30\%}$           $\textbf{92\%}$
  RT-1-X                           $27\%$                        $27\%$                    $73\%$
  RT-2-X (55B)                 $\textbf{50\%}$               $\textbf{30\%}$           $\textbf{91\%}$

  :  `\small `{=latex}Parameter count scaling experiment to assess the impact of capacity on absorbing large-scale diverse embodiment data. For these large-scale datasets (Bridge and RT-1 paper data), RT-1-X underfits and performs worse than the Original Method and RT-1. RT-2-X model with significantly many more parameters can obtain strong performance in these two evaluation scenarios.
:::

```{=latex}
\vspace{-0.15em}
```
```{=latex}
\vspace{-3em}
```
```{=latex}
\vspace{-0.2em}
```
Experimental Results
====================

```{=latex}
\vspace{-0.1em}
```
Our experiments answer three questions about the effect of `\cro`{=latex}-embodiment training: (1) Can policies trained on our `\cro`{=latex}-embodiment dataset effectively enable positive transfer, such that co-training on data collected on multiple robots improves performance on the training task? (2) Does co-training models on data from multiple platforms and tasks improve generalization to new, unseen tasks? (3) What is the influence of different design dimensions, such as model size, model architecture or dataset composition, on performance and generalization capabilities of the resulting policy? To answer these questions we conduct the total number of 3600 evaluation trials across 6 different robots.

In-distribution performance across different embodiments
--------------------------------------------------------

To assess the ability of RT-X models to learn from `\cro`{=latex}-embodiment data, we evaluate performance on in-distribution tasks. We split our evaluation into two types: evaluation on domains that have small-scale datasets (`\cref{fig:rt_x_barplot}`{=latex}), where we would expect transfer from larger datasets to significantly improve performance, and evaluation on domains that have large-scale datasets (`\cref{tab:result_on_dataset_with_more_than_5k_episodes}`{=latex}), where we expect further improvement to be more challenging. Note that we use the same robotics data *training* mixture (defined in Sec. `\ref{sec:training_details}`{=latex}) for all the evaluations presented in this section. For small-scale dataset experiments, we use Kitchen Manipulation [@dass2023jacoplay], Cable Routing [@luo2023multistage], NYU Door Opening [@pari2021surprising], AUTOLab UR5 [@BerkeleyUR5Website], and Robot Play [@TacoPlayWebsite]. We use the same evaluation and robot embodiment as in the respective publications. For large-scale dataset experiments, we consider Bridge [@walke2023bridgedata] and RT-1 [@brohan2023rt1] for in-distribution evaluation and use their respective robots: WidowX and `\edrbot`{=latex}.

For each small dataset domain, we compare the performance of the RT-1-X model, and for each large dataset we consider both the RT-1-X and RT-2-X models. For all experiments, the models are co-trained on the full `\cro`{=latex}-embodiment dataset. Throughout this evaluation we compare with two baseline models: (1) The model developed by the creators of the dataset trained only on that respective dataset. This constitutes a reasonable baseline insofar as it can be expected that the model has been optimized to work well with the associated data; we refer to this baseline model as the *Original Method* model. (2) An RT-1 model trained on the dataset in isolation; this baseline allows us to assess whether the RT-X model architectures have enough capacity to represent policies for multiple different robot platforms simultaneously, and whether co-training on multi-embodiment data leads to higher performance.

**Small-scale dataset domains** (`\cref{fig:rt_x_barplot}`{=latex}). RT-1-X outperforms Original Method trained on each of the robot-specific datasets on 4 of the 5 datasets, with a large average improvement, demonstrating domains with limited data benefit substantially from co-training on `\cro`{=latex}-embodiment data.

```{=latex}
\begin{table*}
        \centering
        \scriptsize
        \setlength{\tabcolsep}{3pt}
        \begin{tabular}{@{}ccccccccc@{}}
        \toprule
        Row & Model & Size & History Length & Dataset & Co-Trained w/ Web & Initial Checkpoint & Emergent Skills Evaluation & RT-2 Generalization Evaluation \\
        \midrule
        (1) & RT-2 & 55B & none & \edrbot action & Yes & Web-pretrained & $27.3\%$ & $\textbf{62\%}$ \\
        (2) & RT-2-X & 55B & none & Robotics data & Yes & Web-pretrained & $\textbf{75.8\%}$ & $\textbf{61\%}$ \\
 (3) & RT-2-X & 55B & none & Robotics data except Bridge & Yes & Web-pretrained & $42.8\%$ & $54\%$ \\
 (4) & RT-2-X & 5B & 2 & Robotics data & Yes  & Web-pretrained & $44.4\%$ & $52\%$ \\
 (5) & RT-2-X & 5B & none & Robotics data & Yes & Web-pretrained & $14.5\%$ & $30\%$ \\
 (6) & RT-2-X & 5B & 2 & Robotics data & No  & From scratch & $0\%$ & $1\%$ \\
(7) & RT-2-X & 5B & 2 & Robotics data & No  & Web-pretrained & $48.7\%$ & $47\%$ \\ 
        \bottomrule
        \end{tabular}
        \vspace{-0.5em}
        \caption{\small Ablations to show the impact of design decisions on generalization (to unseen objects, backgrounds, and environments) and emergent skills (skills from other datasets on the \edrbot), showing the importance of Web-pretraining, model size, and history.} 
        \label{tab:rt_2_x_generalization_and_ablation_results}
        \vspace{-3em}
\end{table*}
```
**Large-scale dataset domains** (`\cref{tab:result_on_dataset_with_more_than_5k_episodes}`{=latex}). In the large-dataset setting, the RT-1-X model does not outperform the RT-1 baseline trained on only the embodiment-specific dataset, which indicates underfitting for that model class. However, the larger RT-2-X model outperforms both the Original Method and RT-1 suggesting that `\cro`{=latex}-robot training can improve performance in the data-rich domains, but only when utilizing a sufficiently high-capacity architecture.

```{=latex}
\vspace{-0.2em}
```
Improved generalization to out-of-distribution settings
-------------------------------------------------------

```{=latex}
\vspace{-0.1em}
```
We now examine how `\cro`{=latex}-embodiment training can enable better generalization to out-of-distribution settings and more complex and novel instructions. These experiments focus on the high-data domains, and use the RT-2-X model.

**Unseen objects, backgrounds and environments.** We first conduct the same evaluation of generalization properties as proposed in [@brohan2023rt2], testing for the ability to manipulate unseen objects in unseen environments and against unseen backgrounds. We find that RT-2 and RT-2-X perform roughly on par (`\cref{tab:rt_2_x_generalization_and_ablation_results}`{=latex}, rows (1) and (2), last column). This is not unexpected, since RT-2 already generalizes well (see [@brohan2023rt2]) along these dimensions due to its VLM backbone.

**Emergent skills evaluation.** To investigate the transfer of knowledge across robots, we conduct experiments with the `\edrbot`{=latex}, assessing the performance on tasks like the ones shown in `\cref{fig:emergent_skill}`{=latex}. These tasks involve objects and skills that are not present in the RT-2 dataset but occur in the Bridge dataset [@walke2023bridgedata] for a different robot (the *WidowX robot*). Results are shown in `\cref{tab:rt_2_x_generalization_and_ablation_results}`{=latex}, Emergent Skills Evaluation column. Comparing rows (1) and (2), we find that RT-2-X outperforms RT-2 by $\sim3\times$, suggesting that incorporating data from other robots into the training improves the range of tasks that can be performed even by a robot that already has large amounts of data available. Our results suggest that co-training with data from other platforms imbues the RT-2-X controller with additional skills for the platform that are not present in that platform's original dataset.

Our next ablation involves removing the Bridge dataset from RT-2-X training: Row (3) shows the results for RT-2-X that includes all data used for RT-2-X except the Bridge dataset. This variation significantly reduces performance on the hold-out tasks, suggesting that transfer from the *WidowX* data may indeed be responsible for the additional skills that can be performed by RT-2-X with the `\edrbot`{=latex}.

```{=latex}
\vspace{-0.4em}
```
Design decisions
----------------

```{=latex}
\vspace{-0.25em}
```
Lastly, we perform ablations to measure the influence of different design decisions on the generalization capabilities of our most performant RT-2-X model, which are presented in `\cref{tab:rt_2_x_generalization_and_ablation_results}`{=latex}. We note that including a short history of images significantly improves generalization performance (row (4) vs row (5)). Similarly to the conclusions in the RT-2 paper [@brohan2023rt2], Web-based pre-training of the model is critical to achieving a high performance for the large models (row (4) vs row (6)). We also note that the $55B$ model has significantly higher success rate in the Emergent Skills compared to the $5B$ model (row (2) vs row (4)), demonstrating that higher model capacity enables higher degree of transfer across robotic datasets. Contrary to previous RT-2 findings, co-fine-tuning and fine-tuning have similar performance in both the Emergent Skills and Generalization Evaluation (row (4) vs row (7)), which we attribute to the fact that the robotics data used in RT-2-X is much more diverse than the previously used robotics datasets.

```{=latex}
\vspace{-0.3em}
```
Discussion, Future Work, and Open Problems
==========================================

```{=latex}
\vspace{-0.3em}
```
`\label{sec:conclusions}`{=latex}

We presented a consolidated dataset that combines data from $\numembodiment$ robotic embodiments collected through a collaboration between $\numinstitution$ institutions, demonstrating $\numskill$ skills ($\numtask$ tasks). We also presented an experimental demonstration that Transformer-based policies trained on this data can exhibit significant positive transfer between the different robots in the dataset. Our results showed that the RT-1-X policy has a $50\%$ higher success rate than the original, state-of-the-art methods contributed by different collaborating institutions, while the bigger vision-language-model-based version (RT-2-X) demonstrated $\sim3\times$ generalization improvements over a model trained only on data from the evaluation embodiment. In addition, we provided multiple resources for the robotics community to explore the `\cro`{=latex}-embodiment robot learning research, including: the unified `\cro`{=latex}-robot and `\cro`{=latex}-institution dataset, sample code showing how to use the data, and the RT-1-X model to serve as a foundation for future exploration.

```{=latex}
\vspace{0.2em}
```
```{=latex}
\centering
```
![`\small `{=latex}To assess transfer *between* embodiments, we evaluate the RT-2-X model on out-of-distribution skills. These skills are in the Bridge dataset, but not in the `\edrbot `{=latex}dataset (the embodiment they are evaluated on). ](figures/rt2x_eval_v9.png "fig:"){#fig:emergent_skill width="40%"} `\vspace{-1.5em}`{=latex}

```{=latex}
\vspace{-2.3em}
```
While RT-X demonstrates a step towards a `\cro`{=latex}-embodied robot generalist, many more steps are needed to make this future a reality. Our experiments do not consider robots with very different sensing and actuation modalities. They do not study generalization to new robots, and provide a decision criterion for when positive transfer does or does not happen. Studying these questions is an important future work direction. This work serves not only as an example that `\cro`{=latex}-robot learning is feasible and practical, but also provide the tools to advance research in this direction in the future.

```{=latex}
\bibliographystyle{IEEEtran}
```
