ChatPaper.aiChatPaper

蒸餾特徵場實現少樣本語言引導式物體操控

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

July 27, 2023
作者: William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola
cs.AI

摘要

自監督與語言監督的影像模型蘊含著對世界運作的豐富知識,這對泛化能力至關重要。然而許多機器人任務需要對三維幾何具備細緻理解,而二維影像特徵往往缺乏此種能力。本研究透過運用蒸餾特徵場,將精確的三維幾何資訊與二維基礎模型中的豐富語義相結合,從而為機器人操作任務彌合二維至三維的認知鴻溝。我們提出一種適用於六自由度抓取放置任務的小樣本學習方法,該方法利用這些強大的空間與語義先驗知識,實現對未知物體的實境泛化能力。透過從視覺語言模型CLIP蒸餾特徵,我們展示一種能以自由文本自然語言指定待操作新物體的方法,並驗證其對未見過的表達方式與新類別物體具備泛化能力。
English
Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.
PDF90April 9, 2026