RT-H: 言語を用いたアクションハイアーキー

要旨

言語は、複雑な概念を理解しやすい断片に分解する手段を提供します。最近のロボット模倣学習の研究では、視覚的観測と言語で指定された高レベルのタスクを基に行動を予測する言語条件付きポリシーが使用されています。これらの手法は、自然言語の構造を活用して、意味的に類似したタスク（例：「コーラ缶を拾う」と「リンゴを拾う」）間でデータを共有します。しかし、タスクが意味的に多様化する（例：「コーラ缶を拾う」と「カップに注ぐ」）につれ、タスク間でデータを共有することが難しくなり、高レベルのタスクを行動にマッピングする学習にはより多くのデモンストレーションデータが必要となります。タスクと行動を橋渡しするために、私たちの洞察は、ロボットに行動の言語を教えることです。具体的には、「腕を前に動かす」といったより細かいフレーズで低レベルの動作を記述します。これらの言語動作をタスクと行動の中間ステップとして予測することで、ポリシーは一見異なるタスク間で低レベルの動作の共有構造を学習することを強制されます。さらに、言語動作に条件付けられたポリシーは、実行中に人間が指定する言語動作を通じて簡単に修正できます。これにより、言語による人間の介入から学習できる柔軟なポリシーの新しいパラダイムが可能になります。私たちの手法RT-Hは、言語動作を使用して行動階層を構築します。まず言語動作を予測することを学習し、これと高レベルのタスクに基づいて行動を予測します。すべての段階で視覚的コンテキストを使用します。RT-Hがこの言語-行動階層を活用して、マルチタスクデータセットに効果的にアクセスすることで、より堅牢で柔軟なポリシーを学習することを示します。これらのポリシーが、言語介入に対応するだけでなく、そのような介入から学習し、遠隔操作による介入から学習する手法を上回ることを示します。私たちのウェブサイトと動画はhttps://rt-hierarchy.github.ioにあります。

English

Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

RT-H: 言語を用いたアクションハイアーキー

RT-H: Action Hierarchies Using Language

要旨

Support