지시 수행을 위한 목표 표현: 제어를 위한 준지도 학습 언어 인터페이스

초록

우리의 목표는 로봇이 "수건을 전자레인지 옆에 놓아라"와 같은 자연어 명령을 따르도록 하는 것입니다. 그러나 작업 시연과 언어 명령이 함께 레이블링된 대량의 데이터를 확보하는 것은 매우 어렵습니다. 반면, 이미지 목표에 반응하는 정책을 얻는 것은 훨씬 쉬운데, 이는 모든 자율적인 시도나 시연이 사후에 최종 상태를 목표로 레이블링될 수 있기 때문입니다. 본 연구에서는 소량의 언어 데이터만을 사용하여 이미지와 목표 조건화 정책을 언어와 결합하는 방법을 제안합니다. 기존 연구에서는 비전-언어 모델을 사용하거나 언어-목표 조건화 정책을 공동으로 훈련함으로써 이 문제에 대한 진전을 이루었지만, 아직까지는 상당한 인간 주석 없이 실제 로봇 작업에 효과적으로 확장되지 못했습니다. 우리의 방법은 레이블링된 데이터에서 언어를 목표 이미지가 아니라 명령에 해당하는 시작 이미지와 목표 이미지 사이의 원하는 변화에 정렬하는 임베딩을 학습함으로써 실제 환경에서 견고한 성능을 달성합니다. 그런 다음 이 임베딩을 기반으로 정책을 훈련시킵니다: 정책은 모든 레이블링되지 않은 데이터의 이점을 누리지만, 정렬된 임베딩은 언어가 정책을 조종할 수 있는 인터페이스를 제공합니다. 우리는 다양한 장면에서의 조작 작업에 걸쳐 명령을 따르는 것을 보여주며, 레이블링된 데이터 외부의 언어 명령에 대한 일반화도 가능함을 입증합니다. 우리의 접근 방식에 대한 비디오와 코드는 웹사이트(http://tiny.cc/grif)에서 확인할 수 있습니다.

English

Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: http://tiny.cc/grif .

지시 수행을 위한 목표 표현: 제어를 위한 준지도 학습 언어 인터페이스

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

초록

Support