증류된 특징 필드를 통한 소샷 언어 기반 조작 가능성

초록

자기 지도 및 언어 지도 이미지 모델은 일반화에 중요한 세계에 대한 풍부한 지식을 포함하고 있습니다. 그러나 많은 로봇 작업은 3D 기하학에 대한 세부적인 이해를 필요로 하는데, 이는 2D 이미지 특징에서는 종종 부족합니다. 본 연구는 정확한 3D 기하학과 2D 기반 모델의 풍부한 의미론을 결합하기 위해 증류된 특징 필드를 활용하여 로봇 조작을 위한 2D에서 3D 간의 격차를 해소합니다. 우리는 강력한 공간적 및 의미론적 사전 지식을 활용하여 보지 못한 물체에 대한 야외 일반화를 달성하는 6자유도 그랩핑 및 배치를 위한 소수 샷 학습 방법을 제시합니다. 비전-언어 모델인 CLIP에서 증류된 특징을 사용하여, 자유 텍스트 자연어를 통해 조작할 새로운 물체를 지정하는 방법을 제시하고, 보지 못한 표현과 새로운 범주의 물체에 대한 일반화 능력을 입증합니다.

English

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

증류된 특징 필드를 통한 소샷 언어 기반 조작 가능성

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

초록

Support