RT-2：視覺-語言-動作模型將網絡知識轉移到機器人控制。

摘要

我們研究了如何將在網際網路規模數據上訓練的視覺語言模型直接融入端到端的機器人控制中，以提升泛化能力並實現新興的語義推理。我們的目標是讓一個端到端訓練的模型同時學會將機器人觀測映射到動作，並享受來自網絡語言和視覺語言數據的大規模預訓練帶來的好處。為此，我們提議對最先進的視覺語言模型在機器人軌跡數據和網際網路視覺語言任務（如視覺問答）上進行聯合微調。與其他方法相比，我們提出了一個簡單通用的配方來實現這一目標：為了將自然語言回應和機器人動作放入相同格式，我們將動作表達為文本標記，並直接將其納入模型的訓練集中，方式與自然語言標記相同。我們將這類模型稱為視覺語言動作模型（VLA），並實例化了一個這樣的模型，我們稱之為RT-2。我們的廣泛評估（6k評估試驗）顯示，我們的方法導致了高性能的機器人策略，並使RT-2能夠從網際網路規模訓練中獲得一系列新興能力。這包括對新物體的顯著改進泛化能力，能夠解釋機器人訓練數據中不存在的命令（例如將物體放在特定數字或圖標上），以及能夠對用戶命令做出基本推理（例如拿起最小或最大的物體，或最接近另一個物體的物體）。我們進一步展示，將思維鏈推理納入其中使RT-2能夠進行多階段語義推理，例如找出用作臨時錘子的物體（一塊石頭），或者找出哪種飲料最適合疲憊的人（能量飲料）。

English

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

RT-2：視覺-語言-動作模型將網絡知識轉移到機器人控制。

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

摘要

Support