Representações de Objetivos para Seguimento de Instruções: Uma Interface de Linguagem Semi-Supervisionada para Controle

Resumo

Nosso objetivo é que robôs sigam instruções em linguagem natural como "coloque a toalha ao lado do micro-ondas". No entanto, obter grandes quantidades de dados rotulados, ou seja, dados que contêm demonstrações de tarefas associadas à instrução em linguagem, é proibitivo. Em contraste, obter políticas que respondem a objetivos baseados em imagens é muito mais fácil, porque qualquer tentativa autônoma ou demonstração pode ser rotulada posteriormente com seu estado final como o objetivo. Neste trabalho, contribuímos com um método que aproveita políticas condicionadas por imagem e objetivo em conjunto com linguagem, utilizando apenas uma pequena quantidade de dados linguísticos. Trabalhos anteriores fizeram progressos nessa área usando modelos de visão e linguagem ou treinando conjuntamente políticas condicionadas por linguagem e objetivo, mas até agora nenhum dos métodos escalou efetivamente para tarefas robóticas do mundo real sem anotações humanas significativas. Nosso método alcança um desempenho robusto no mundo real ao aprender uma representação a partir dos dados rotulados que alinha a linguagem não à imagem do objetivo, mas sim à mudança desejada entre as imagens inicial e final que a instrução corresponde. Em seguida, treinamos uma política nessa representação: a política se beneficia de todos os dados não rotulados, mas a representação alinhada fornece uma interface para que a linguagem direcione a política. Mostramos o seguimento de instruções em uma variedade de tarefas de manipulação em diferentes cenários, com generalização para instruções linguísticas fora dos dados rotulados. Vídeos e código da nossa abordagem podem ser encontrados em nosso site: http://tiny.cc/grif.

English

Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: http://tiny.cc/grif .

Representações de Objetivos para Seguimento de Instruções: Uma Interface de Linguagem Semi-Supervisionada para Controle

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

Resumo

Support