자유 형식 언어 기반 로봇 추론 및 파지

초록

사람의 지시에 따라 복잡한 물건 더미에서 로봇 그리핑을 수행하는 것은 자유 형식 언어의 미묘한 차이와 물체 간 공간적 관계를 모두 이해해야 하기 때문에 매우 도전적인 과제입니다. GPT-4o와 같은 웹 규모 데이터로 학습된 비전-언어 모델(VLMs)은 텍스트와 이미지 모두에서 놀라운 추론 능력을 보여왔습니다. 하지만 이러한 모델이 제로샷 설정에서 이 과제에 실제로 사용될 수 있을까요? 그리고 그 한계는 무엇일까요? 본 논문에서는 자유 형식 언어 기반 로봇 그리핑 과제를 통해 이러한 연구 질문을 탐구하고, 사전 학습된 VLMs의 세계 지식을 활용하여 사람의 지시와 물체의 공간적 배열을 추론하는 새로운 방법인 FreeGrasp를 제안합니다. 우리의 방법은 모든 물체를 키포인트로 감지하고 이러한 키포인트를 사용하여 이미지에 주석을 달아 GPT-4o의 제로샷 공간 추론을 용이하게 합니다. 이를 통해 요청된 물체가 직접 그리핑 가능한지, 아니면 다른 물체를 먼저 그리핑하고 제거해야 하는지를 판단할 수 있습니다. 이 과제를 위해 특별히 설계된 기존 데이터셋이 없기 때문에, 우리는 MetaGraspNetV2 데이터셋을 확장하여 사람이 주석을 단 지시와 실제 그리핑 시퀀스를 포함한 합성 데이터셋 FreeGraspData를 소개합니다. FreeGraspData를 사용한 광범위한 분석과 그리퍼가 장착된 로봇 암을 이용한 실제 환경 검증을 통해, 우리는 그리핑 추론과 실행에서 최첨단 성능을 입증합니다. 프로젝트 웹사이트: https://tev-fbk.github.io/FreeGrasp/.

English

Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.

자유 형식 언어 기반 로봇 추론 및 파지

Free-form language-based robotic reasoning and grasping

초록

Support