RT-2: 視覚-言語-行動モデルによるウェブ知識のロボット制御への転移

要旨

インターネット規模のデータでトレーニングされた視覚言語モデルを、エンドツーエンドのロボット制御に直接組み込むことで、汎化能力を向上させ、新たな意味推論を可能にする方法を研究します。私たちの目標は、単一のエンドツーエンドでトレーニングされたモデルが、ロボットの観測を行動にマッピングすることを学習すると同時に、ウェブ上の言語および視覚言語データの大規模な事前トレーニングの恩恵を受けることです。この目的のために、最先端の視覚言語モデルをロボットの軌跡データとインターネット規模の視覚言語タスク（例えば視覚的質問応答）の両方で共同ファインチューニングすることを提案します。他のアプローチとは異なり、この目標を達成するためのシンプルで汎用的な方法を提案します。自然言語の応答とロボットの行動を同じ形式に適合させるために、行動をテキストトークンとして表現し、自然言語トークンと同じ方法でモデルのトレーニングセットに直接組み込みます。このようなモデルのカテゴリを視覚言語行動モデル（VLA）と呼び、その一例としてRT-2というモデルを実装します。大規模な評価（6,000回の評価試行）により、私たちのアプローチが高性能なロボットポリシーを導き、RT-2がインターネット規模のトレーニングから新たな能力を獲得することを示します。これには、新しいオブジェクトへの大幅に改善された汎化能力、ロボットのトレーニングデータに存在しないコマンドの解釈能力（特定の数字やアイコンの上にオブジェクトを置くなど）、ユーザーのコマンドに応じた初歩的な推論能力（最小または最大のオブジェクトを拾う、または別のオブジェクトに最も近いオブジェクトを拾うなど）が含まれます。さらに、連鎖的思考推論を組み込むことで、RT-2が多段階の意味推論を実行できることを示します。例えば、即席のハンマーとして使用するためにどのオブジェクトを拾うべきか（岩）、または疲れている人に最適な飲み物はどれか（エナジードリンク）を判断するなどです。

English

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

RT-2: 視覚-言語-行動モデルによるウェブ知識のロボット制御への転移

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

要旨

Support