ChatPaper.aiChatPaper

指导性目标表示:用于指令跟随的半监督语言界面控制

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

June 30, 2023
作者: Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine
cs.AI

摘要

我们的目标是使机器人能够遵循自然语言指令,比如“把毛巾放在微波炉旁边”。然而,获取大量标记数据,即包含用语言指令标记的任务演示的数据,是困难的。相比之下,获取响应图像目标的策略要容易得多,因为任何自主试验或演示都可以事后用其最终状态作为目标进行标记。在这项工作中,我们提出了一种方法,利用少量语言数据,结合图像和目标来调节策略。先前的工作已经在这方面取得了进展,使用视觉语言模型或联合训练语言目标调节的策略,但迄今为止,这两种方法都没有有效地扩展到真实世界的机器人任务,而无需大量人工注释。我们的方法通过学习一个从标记数据中对齐语言的嵌入,实现了在真实世界中的稳健性能,这个嵌入将语言与目标图像对齐,而不是与指令对应的起始图像和目标图像之间的期望变化。然后我们在这个嵌入上训练一个策略:策略受益于所有未标记数据,但对齐的嵌入为语言提供了引导策略的接口。我们展示了在不同场景中进行各种操纵任务时的指令跟随,可以泛化到标记数据之外的语言指令。我们的方法的视频和代码可以在我们的网站上找到:http://tiny.cc/grif。
English
Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: http://tiny.cc/grif .
PDF60December 15, 2024