VoxPoser：使用語言模型進行機器人操作的可組合3D值地圖

摘要

大型語言模型（LLMs）被證明擁有豐富的可操作知識，可以以推理和規劃的形式提取，用於機器人操作。儘管取得了進展，但大多數仍依賴預定義的運動基元來執行與環境的物理交互作用，這仍然是一個主要瓶頸。在這項工作中，我們的目標是綜合機器人軌跡，即一系列密集的6自由度末端點，用於各種操控任務，並給定一組開放式指令和一組開放式物體。我們通過首先觀察到LLMs擅長根據自由形式語言指令推斷可供性和約束來實現這一目標。更重要的是，通過利用它們的編碼能力，它們可以與視覺語言模型（VLM）互動，以構建3D價值地圖，將知識植入到代理的觀察空間中。然後，在基於模型的規劃框架中使用這些構建的價值地圖，以零炮擊合成對動態干擾具有魯棒性的閉環機器人軌跡。我們進一步展示了所提出的框架如何從在線經驗中受益，通過有效地學習涉及接觸豐富交互的場景的動力學模型。我們在模擬和真實機器人環境中進行了所提出方法的大規模研究，展示了能夠執行各種自由形式自然語言中指定的日常操作任務的能力。項目網站：https://voxposer.github.io

English

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a visual-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Project website: https://voxposer.github.io

VoxPoser：使用語言模型進行機器人操作的可組合3D值地圖

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

摘要

Support