蒸餾特徵場使得少樣本語言引導操作成為可能。

摘要

自我監督和語言監督的影像模型包含對於泛化至關重要的世界豐富知識。然而，許多機器人任務需要對3D幾何有詳細的理解，而這在2D影像特徵中通常缺乏。本研究通過利用提煉的特徵場來將準確的3D幾何與2D基礎模型的豐富語義結合，以彌合機器人操作中的2D至3D差距。我們提出了一種用於6自由度抓取和放置的少樣本學習方法，利用這些強大的空間和語義先驗，實現對未見物體的野外泛化。通過從視覺語言模型CLIP中提煉的特徵，我們提出了一種通過自由文本自然語言指定新物體進行操作的方法，並展示了其對未見表達和新類別物體的泛化能力。

English

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

蒸餾特徵場使得少樣本語言引導操作成為可能。

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

摘要

Support