Robo2VLM：基于大规模真实场景机器人操作数据集的视觉问答系统

摘要

视觉语言模型（VLMs）通过互联网规模的图文语料库获取现实世界知识和通用推理能力。它们能够增强机器人系统的场景理解与任务规划能力，并辅助基于机器人轨迹数据训练的视觉运动策略。我们探索了逆向范式——利用丰富、真实、多模态的机器人轨迹数据来提升和评估VLMs。本文中，我们提出了Robo2VLM，一个专为VLMs设计的视觉问答（VQA）数据集生成框架。给定一条人类远程操作的机器人轨迹，Robo2VLM从非视觉且非描述性的感知模态（如末端执行器姿态、夹爪开合度及力觉传感）中提取真值信息。基于这些模态，它将机器人轨迹分割为一系列操作阶段。在每个阶段，Robo2VLM利用场景与交互理解，识别机器人的三维属性、任务目标及目标物体。这些属性被用于生成代表性的VQA查询——即带有文本多选题的图像——基于空间、目标条件及交互推理的问题模板。我们精心构建了Robo2VLM-1，一个大规模真实场景数据集，包含684,710个问题，覆盖463个独特场景和来自176,000条真实机器人轨迹的3,396个机器人操作任务。结果表明，Robo2VLM-1能够基准测试并提升VLMs在空间与交互推理方面的能力。

English

Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.

Robo2VLM：基于大规模真实场景机器人操作数据集的视觉问答系统

Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

摘要

Support