RT-H: 使用语言的动作层次结构

摘要

语言提供了一种将复杂概念分解为易于理解部分的方式。最近在机器人模仿学习领域的研究中，使用了以语言为条件的策略，根据视觉观察和语言规定的高级任务来预测动作。这些方法利用自然语言的结构在多任务数据集中在语义上相似的任务之间共享数据（例如，“拿可乐罐”和“摘苹果”）。然而，随着任务在语义上变得更加多样化（例如，“拿可乐罐”和“倒杯子”），任务之间的数据共享变得更加困难，因此学习将高级任务映射到动作需要更多的示范数据。为了搭建任务和动作之间的桥梁，我们的洞察力在于教会机器人动作的语言，用更精细的短语描述低级运动，比如“向前移动手臂”。将这些语言运动预测作为任务和动作之间的中间步骤，迫使策略学习跨看似不同的任务之间共享的低级运动结构。此外，一个以语言运动为条件的策略在执行过程中可以通过人为指定的语言运动进行轻松校正。这为能够从语言中的人为干预中学习的灵活策略打开了新的范式。我们的方法RT-H利用语言运动构建了一个动作层次结构：首先学习预测语言运动，然后基于此和高级任务预测动作，在所有阶段使用视觉上下文。我们展示了RT-H利用这种语言-动作层次结构学习出更加强健和灵活的策略，有效地利用多任务数据集。我们展示了这些策略不仅可以响应语言干预，还可以从这些干预中学习，并且胜过从远程操作干预中学习的方法。我们的网站和视频可在https://rt-hierarchy.github.io找到。

English

Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

RT-H: 使用语言的动作层次结构

RT-H: Action Hierarchies Using Language

摘要

Support