ChatPaper.aiChatPaper

蒸馏特征场实现少样本语言引导的机器人操作

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

July 27, 2023
作者: William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola
cs.AI

摘要

自监督与语言监督的图像模型蕴含了对泛化至关重要的世界知识。然而许多机器人任务需要精确的三维几何理解,而这正是二维图像特征通常欠缺的。本研究通过利用蒸馏特征场融合精确三维几何与二维基础模型的丰富语义,为机器人操作架起了二维到三维的桥梁。我们提出了一种小样本学习方法,用于六自由度抓取与放置任务,该方法利用这些强大的空间与语义先验知识,实现了对未知物体的野外泛化能力。通过从视觉语言模型CLIP中蒸馏特征,我们实现了基于自由文本语言指令指定待操作新物体的方法,并验证了其对未见表达方式和新型物体类别的泛化能力。
English
Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.
PDF80February 8, 2026