RT-H：使用語言的動作層次結構

摘要

語言提供了一種將複雜概念分解為易於理解部分的方法。最近在機器人模仿學習方面的研究作品使用了以語言為條件的策略，根據視覺觀察和語言中指定的高層任務來預測動作。這些方法利用自然語言的結構在多任務數據集中在語義上相似的任務之間共享數據（例如，“拿可樂罐”和“摘蘋果”）。然而，隨著任務在語義上變得更加多樣化（例如，“拿可樂罐”和“倒杯子”），在任務之間共享數據變得更加困難，因此學習將高層任務映射到動作需要更多的示範數據。為了搭建任務和動作之間的橋樑，我們的見解是教導機器人動作的語言，用更細緻的短語描述低層運動，例如“向前移動手臂”。將這些語言運動預測作為任務和動作之間的中間步驟迫使策略學習跨看似不同的任務之間的低層運動的共享結構。此外，一個以語言運動為條件的策略在執行過程中可以輕鬆通過人類指定的語言運動進行校正。這為靈活策略的新範式提供了可能，這些策略可以從語言中的人類干預中學習。我們的方法 RT-H 使用語言運動構建動作層次結構：首先學習預測語言運動，並根據此和高層任務來預測動作，在所有階段使用視覺上下文。我們展示了 RT-H 利用這種語言-動作層次結構來學習更具彈性和韌性的策略，有效地利用多任務數據集。我們展示了這些策略不僅可以對語言干預做出反應，還可以從這些干預中學習，並且優於從遠程操作干預中學習的方法。我們的網站和視頻可在 https://rt-hierarchy.github.io 找到。

English

Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

RT-H：使用語言的動作層次結構

RT-H: Action Hierarchies Using Language

摘要

Support