目標表示法用於指令跟隨：一個半監督式語言界面控制

摘要

我們的目標是讓機器人能夠遵循自然語言指令，如“將毛巾放在微波爐旁邊”。然而，獲取大量標記數據，即包含以語言指令標記的任務演示的數據，是不切實際的。相比之下，獲取對應圖像目標的策略要容易得多，因為任何自主試驗或演示都可以事後標記其最終狀態作為目標。在這項工作中，我們提出了一種方法，該方法僅使用少量語言數據來利用聯合圖像和目標條件下的語言策略。先前的工作已經在這方面取得了進展，使用視覺語言模型或通過聯合訓練語言目標條件下的策略，但迄今為止，這兩種方法都沒有有效擴展到現實世界的機器人任務，而無需大量人工標註。我們的方法通過從標記數據中學習將語言對齊到目標圖像而不是對齊到指令對應的起始和目標圖像之間的期望變化，從而實現了在現實世界中的強大性能。然後，我們在這個嵌入上訓練一個策略：該策略受益於所有未標記數據，但對齊的嵌入提供了一個接口，讓語言引導策略。我們展示了在不同場景中的各種操作任務中遵循指令，並且能夠推廣到標記數據之外的語言指令。有關我們方法的視頻和代碼可在我們的網站上找到：http://tiny.cc/grif。

English

Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: http://tiny.cc/grif .

目標表示法用於指令跟隨：一個半監督式語言界面控制

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

摘要

Support