RoboVQA：用於機器人的多模態長時程推理

摘要

我們提出了一種可擴展、自底向上且內在多樣的數據收集方案，可用於具有長中程視野的高層推理，其吞吐量比傳統狹窄自上而下逐步收集高出2.2倍。我們通過在三座辦公大樓的整個範圍內執行任何用戶請求並使用多個機器人和人類實體來收集現實數據。通過這些數據，我們展示出在所有實體上訓練的模型表現優於僅在機器人數據上訓練的模型，即使僅在機器人情節上進行評估也是如此。我們發現，在固定的收集預算下，利用更便宜的人類收集與機器人收集是有益的。我們釋出了一個名為RoboVQA的大型且高度多樣（29,520個獨特指令）數據集，其中包含829,502個（視頻、文本）對，用於針對機器人的視覺問答。我們還展示了如何通過介入機制評估真實機器人實驗，實現任務完成，即使不完美也可在人類監督下部署，同時提供單一性能指標。我們展示了一個名為RoboVQA-VideoCoCa的單一視頻條件模型，該模型在我們的數據集上訓練，能夠在廣泛現實環境中執行各種基於地面的高層推理任務，其認知介入率比零樣本最先進的視覺語言模型（VLM）基線低46％，並能夠引導真實機器人完成長期任務。與零樣本最先進模型的性能差距表明，仍需收集大量基於地面的數據以進行現實世界部署，強調了可擴展數據收集方法的關鍵需求。最後，我們展示了視頻VLM在所有VQA任務中平均錯誤率降低19％，明顯優於單張圖像VLM。數據和視頻可在https://robovqa.github.io獲得。

English

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at https://robovqa.github.io

RoboVQA：用於機器人的多模態長時程推理

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

摘要

Support