RoboVQA：机器人的多模态长时程推理

摘要

我们提出了一种可扩展、自底向上且固有多样化的数据收集方案，可用于具有长期和中期视野的高级推理，其吞吐量比传统的狭窄自上而下逐步收集高出2.2倍。我们通过在3栋办公大楼的全部范围内执行任何用户请求，并利用多个机器人和人类实体来收集现实数据。通过这些数据，我们展示了在所有实体上训练的模型比仅在机器人数据上训练的模型表现更好，即使仅在机器人情节上评估也是如此。我们发现，在固定的收集预算下，利用更便宜的人类收集与机器人收集是有益的。我们发布了一个名为RoboVQA的大型且高度多样化（29,520个独特指令）的数据集，其中包含829,502个（视频、文本）对，用于面向机器人的视觉问答。我们还展示了如何通过干预机制评估真实机器人实验，使其能够完成任务，即使存在缺陷也可以在人类监督下部署，同时提供单一性能指标。我们展示了一个名为RoboVQA-VideoCoCa的单一视频条件模型，该模型在我们的数据集上训练，能够在广泛的现实环境中执行各种基于视频的高级推理任务，其认知干预率比零样本最先进的视觉语言模型（VLM）基线低46%，并且能够引导真实机器人完成长期任务。与零样本最先进模型之间的性能差距表明，仍需收集大量基于实地的数据以进行实际部署，强调了可扩展数据收集方法的关键需求。最后，我们展示了视频VLM在所有VQA任务中平均错误率降低了19%，明显优于单图像VLM。数据和视频可在https://robovqa.github.io获取。

English

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at https://robovqa.github.io

RoboVQA：机器人的多模态长时程推理

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

摘要

Support