蒸留された特徴フィールドは、少数ショットの言語誘導操作を可能にする

要旨

自己教師あり学習と言語教師あり学習の画像モデルは、汎化に重要な世界に関する豊富な知識を含んでいます。しかし、多くのロボットタスクでは、3Dジオメトリの詳細な理解が必要であり、これは2D画像特徴ではしばしば欠如しています。本研究は、2D基盤モデルから得られる豊富なセマンティクスと正確な3Dジオメトリを組み合わせるために蒸留特徴フィールドを活用し、ロボット操作における2Dから3Dへのギャップを埋めます。我々は、これらの強力な空間的・意味的事前情報を活用して、未見の物体に対する実世界での汎化を実現する6自由度把持と配置のための少数ショット学習手法を提案します。視覚言語モデルCLIPから蒸留した特徴を用いて、自由形式の自然言語を通じて新しい物体を操作対象として指定する方法を提示し、未見の表現や新規カテゴリの物体に対する汎化能力を実証します。

English

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

蒸留された特徴フィールドは、少数ショットの言語誘導操作を可能にする

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

要旨

Support