RT-H: 언어를 활용한 행동 계층 구조

초록

언어는 복잡한 개념을 소화 가능한 조각으로 나누는 방법을 제공합니다. 최근 로봇 모방 학습 연구에서는 시각적 관측과 언어로 지정된 상위 수준 작업이 주어졌을 때 동작을 예측하는 언어 조건부 정책을 사용합니다. 이러한 방법들은 자연어의 구조를 활용하여 다중 작업 데이터셋에서 의미적으로 유사한 작업들(예: "콜라 캔 집기"와 "사과 집기") 간에 데이터를 공유합니다. 그러나 작업들이 의미적으로 더 다양해질수록(예: "콜라 캔 집기"와 "컵에 따르기"), 작업 간 데이터 공유가 더 어려워지기 때문에 상위 수준 작업을 동작으로 매핑하는 학습에는 훨씬 더 많은 시범 데이터가 필요합니다. 작업과 동작을 연결하기 위해, 우리는 로봇에게 동작의 언어를 가르치는 통찰을 제안합니다. 이는 "팔을 앞으로 움직이기"와 같은 더 세분화된 구문으로 저수준 동작을 설명하는 것입니다. 이러한 언어 동작을 작업과 동작 사이의 중간 단계로 예측하도록 함으로써, 정책이 겉보기에는 관련 없는 작업들 간의 저수준 동작 구조를 학습하도록 강제합니다. 더 나아가, 언어 동작에 조건부된 정책은 실행 중 인간이 지정한 언어 동작을 통해 쉽게 수정될 수 있습니다. 이는 언어를 통한 인간의 개입으로부터 학습할 수 있는 유연한 정책의 새로운 패러다임을 가능하게 합니다. 우리의 방법인 RT-H는 언어 동작을 사용하여 동작 계층 구조를 구축합니다: 먼저 언어 동작을 예측하는 방법을 학습하고, 이를 기반으로 상위 수준 작업에 조건부된 동작을 예측하며, 모든 단계에서 시각적 맥락을 사용합니다. 우리는 RT-H가 이 언어-동작 계층 구조를 활용하여 다중 작업 데이터셋을 효과적으로 활용하여 더 견고하고 유연한 정책을 학습함을 보여줍니다. 이러한 정책들이 언어 개입에 대응할 수 있을 뿐만 아니라, 이러한 개입으로부터 학습하여 원격 조작 개입으로부터 학습하는 방법들을 능가할 수 있음을 보여줍니다. 우리의 웹사이트와 비디오는 https://rt-hierarchy.github.io에서 확인할 수 있습니다.

English

Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

RT-H: 언어를 활용한 행동 계층 구조

RT-H: Action Hierarchies Using Language

초록

Support