RT-2:视觉-语言-动作模型将网络知识转移到机器人控制
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
July 28, 2023
作者: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich
cs.AI
摘要
我们研究了如何将在互联网规模数据上训练的视觉-语言模型直接整合到端到端机器人控制中,以提升泛化能力并实现新兴的语义推理。我们的目标是使单一端到端训练模型既能学习将机器人观测映射到动作,又能享受来自网络的语言和视觉-语言数据的大规模预训练带来的好处。为此,我们提出在机器人轨迹数据和互联网规模视觉-语言任务(如视觉问答)上共同微调最先进的视觉-语言模型。与其他方法相比,我们提出了一个简单通用的方法来实现这一目标:为了将自然语言回复和机器人动作都适应相同的格式,我们将动作表达为文本标记,并直接将其纳入模型的训练集中,与自然语言标记的方式相同。我们将这类模型称为视觉-语言-动作模型(VLA),并实例化了一个这样的模型,我们称之为RT-2。我们进行了广泛的评估(6k次评估试验),结果显示我们的方法导致了高性能的机器人策略,并使RT-2能够从互联网规模训练中获得一系列新兴能力。这包括对新颖对象的显著改进泛化能力,能够解释机器人训练数据中不存在的命令(如将物体放在特定数字或图标上),以及能够对用户命令做出基本推理(如拿起最小或最大的物体,或者最接近另一个物体的物体)。我们进一步展示,引入思维链推理使RT-2能够进行多阶段语义推理,例如找出哪个物体适合用作临时锤子(一块石头),或者哪种饮料最适合疲倦的人(一种能量饮料)。
English
We study how vision-language models trained on Internet-scale data can be
incorporated directly into end-to-end robotic control to boost generalization
and enable emergent semantic reasoning. Our goal is to enable a single
end-to-end trained model to both learn to map robot observations to actions and
enjoy the benefits of large-scale pretraining on language and vision-language
data from the web. To this end, we propose to co-fine-tune state-of-the-art
vision-language models on both robotic trajectory data and Internet-scale
vision-language tasks, such as visual question answering. In contrast to other
approaches, we propose a simple, general recipe to achieve this goal: in order
to fit both natural language responses and robotic actions into the same
format, we express the actions as text tokens and incorporate them directly
into the training set of the model in the same way as natural language tokens.
We refer to such category of models as vision-language-action models (VLA) and
instantiate an example of such a model, which we call RT-2. Our extensive
evaluation (6k evaluation trials) shows that our approach leads to performant
robotic policies and enables RT-2 to obtain a range of emergent capabilities
from Internet-scale training. This includes significantly improved
generalization to novel objects, the ability to interpret commands not present
in the robot training data (such as placing an object onto a particular number
or icon), and the ability to perform rudimentary reasoning in response to user
commands (such as picking up the smallest or largest object, or the one closest
to another object). We further show that incorporating chain of thought
reasoning allows RT-2 to perform multi-stage semantic reasoning, for example
figuring out which object to pick up for use as an improvised hammer (a rock),
or which type of drink is best suited for someone who is tired (an energy
drink).Summary
AI-Generated Summary