RoboVQA：ロボティクスのためのマルチモーダル長期的推論

要旨

本論文では、長期的および中期的な推論に使用可能な、スケーラブルでボトムアップ型の本質的に多様なデータ収集手法を提案する。この手法は、従来の狭いトップダウン型の段階的収集と比較して2.2倍のスループットを実現する。3つのオフィスビル全体でユーザーリクエストを実行し、複数のロボットおよび人間のエンボディメントを使用することで、現実的なデータを収集した。このデータを用いて、すべてのエンボディメントで訓練されたモデルが、ロボットエピソードのみで評価された場合でも、ロボットデータのみで訓練されたモデルよりも優れた性能を示すことを実証した。また、固定された収集予算において、より低コストな人間による収集をロボット収集と併用することが有益であることを発見した。ロボティクスに焦点を当てた視覚的質問応答（VQA）のための大規模で高度に多様なデータセット「RoboVQA」を公開した。このデータセットは29,520のユニークな指示を含む829,502の（動画、テキスト）ペアで構成されている。さらに、介入メカニズムを用いた実ロボット実験の評価が、タスクを完了させることを可能にし、不完全であっても人間の監視下で展開可能にするとともに、単一の性能指標を提供することを示した。提案したデータセットで訓練された単一の動画条件付きモデル「RoboVQA-VideoCoCa」を実証し、広範な現実的な設定で様々なグラウンデッドな高レベル推論タスクを実行可能であり、ゼロショットの最先端視覚言語モデル（VLM）ベースラインと比較して認知介入率が46%低く、長期的タスクを通じて実ロボットを誘導できることを示した。ゼロショットの最先端モデルとの性能差は、実世界での展開に向けてまだ多くのグラウンデッドデータが収集される必要があることを示しており、スケーラブルなデータ収集手法の重要性を強調している。最後に、動画VLMが単一画像VLMを大幅に上回り、すべてのVQAタスクにおいて平均エラー率が19%減少することを示した。データと動画はhttps://robovqa.github.ioで公開されている。

English

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at https://robovqa.github.io

RoboVQA：ロボティクスのためのマルチモーダル長期的推論

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

要旨

Support