基于自由形式语言的机器人推理与抓取
Free-form language-based robotic reasoning and grasping
March 17, 2025
作者: Runyu Jiao, Alice Fasoli, Francesco Giuliari, Matteo Bortolon, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi
cs.AI
摘要
基于人类指令从杂乱容器中执行机器人抓取是一项极具挑战性的任务,因为它需要同时理解自由形式语言的细微差别以及物体间的空间关系。在网页规模数据上训练的视觉-语言模型(VLMs),如GPT-4o,已展现出跨文本和图像的卓越推理能力。然而,它们是否能在零样本设置下真正胜任此任务?又存在哪些局限?本文通过自由语言引导的机器人抓取任务探讨了这些研究问题,并提出了一种新方法——FreeGrasp,该方法利用预训练VLMs的世界知识来推理人类指令及物体空间布局。我们的方法将所有物体检测为关键点,并利用这些关键点在图像上标注标记,旨在促进GPT-4o的零样本空间推理。这使得我们的方法能够判断请求的物体是否可直接抓取,或是需要先抓取并移除其他物体。鉴于现有数据集均未专门为此任务设计,我们通过扩展MetaGraspNetV2数据集,引入了一个合成数据集FreeGraspData,其中包含人工标注的指令和真实抓取序列。我们利用FreeGraspData进行了广泛分析,并配备了夹爪的机械臂进行了现实世界验证,展示了在抓取推理与执行方面的顶尖性能。项目网站:https://tev-fbk.github.io/FreeGrasp/。
English
Performing robotic grasping from a cluttered bin based on human instructions
is a challenging task, as it requires understanding both the nuances of
free-form language and the spatial relationships between objects.
Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have
demonstrated remarkable reasoning capabilities across both text and images. But
can they truly be used for this task in a zero-shot setting? And what are their
limitations? In this paper, we explore these research questions via the
free-form language-based robotic grasping task, and propose a novel method,
FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about
human instructions and object spatial arrangements. Our method detects all
objects as keypoints and uses these keypoints to annotate marks on images,
aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our
method to determine whether a requested object is directly graspable or if
other objects must be grasped and removed first. Since no existing dataset is
specifically designed for this task, we introduce a synthetic dataset
FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated
instructions and ground-truth grasping sequences. We conduct extensive analyses
with both FreeGraspData and real-world validation with a gripper-equipped
robotic arm, demonstrating state-of-the-art performance in grasp reasoning and
execution. Project website: https://tev-fbk.github.io/FreeGrasp/.Summary
AI-Generated Summary