ChatPaper.aiChatPaper

Robo2VLM:基於大規模真實環境機器人操作數據集的視覺問答系統

Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

May 21, 2025
作者: Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Ken Goldberg
cs.AI

摘要

視覺語言模型(VLMs)通過互聯網規模的圖像-文本語料庫獲取現實世界的知識和通用推理能力。它們能夠增強機器人系統的場景理解與任務規劃能力,並輔助基於機器人軌跡數據訓練的視覺運動策略。我們探索了一種逆向範式——利用豐富、真實的多模態機器人軌跡數據來提升和評估VLMs。本文中,我們提出了Robo2VLM,一個專為VLMs設計的視覺問答(VQA)數據集生成框架。給定一條由人類遙控操作的機器人軌跡,Robo2VLM從非視覺和非描述性的感知模態(如末端執行器姿態、夾爪開合度及力覺傳感)中提取真實標註。基於這些模態,它將機器人軌跡分割為一系列操作階段。在每個階段,Robo2VLM利用場景和交互理解來識別機器人的三維屬性、任務目標及目標物體。這些屬性被用於生成代表性的VQA查詢——即帶有多選題文本的圖像——基於空間、目標條件及交互推理的問題模板。我們構建了Robo2VLM-1,一個大規模的野外數據集,包含684,710個問題,覆蓋463個不同場景和來自176k條真實機器人軌跡的3,396個機器人操作任務。結果表明,Robo2VLM-1能夠在空間和交互推理方面對VLM的能力進行基準測試和提升。
English
Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.

Summary

AI-Generated Summary

PDF32May 23, 2025