OST-Bench: オンライン時空間シーン理解におけるMLLMの能力評価

要旨

近年のマルチモーダル大規模言語モデル（MLLMs）の進展は、視覚と言語を統合した複雑な推論において顕著な能力を示しています。既存のベンチマークの多くは、事前に記録された固定された入力セットを用いたオフライン設定でモデルを評価していますが、本研究ではOST-Benchを紹介します。これは、シーンを能動的に探索するエージェントの視点から、オンライン時空間理解を評価するために設計されたベンチマークです。オンラインという側面は、段階的に取得された観察を処理し推論する必要性を強調し、時空間コンポーネントは、現在の視覚入力を過去の記憶と統合して動的な空間推論をサポートすることを要求します。OST-Benchは、現実世界の具現化された知覚の課題をよりよく反映しています。効率的なデータ収集パイプラインに基づいて構築されたOST-Benchは、ScanNet、Matterport3D、およびARKitScenesから収集された1.4kのシーンと10kの質問-回答ペアで構成されています。OST-Benchでいくつかの主要なMLLMsを評価した結果、複雑な時空間推論を必要とするタスクにおいてそれらが不足していることが観察されました。オンライン設定では、探索範囲が広がりメモリが増えるにつれて精度が低下します。さらなる実験分析を通じて、モデル間で共通するエラーパターンを特定し、複雑な手がかりに基づく空間推論の要求と長期的な記憶検索の要件が、2つの異なる軸に沿ってモデルのパフォーマンスを大幅に低下させることがわかりました。これは、オンライン具現化推論を改善するために取り組むべき核心的な課題を浮き彫りにしています。この分野のさらなる研究と開発を促進するために、コード、データセット、およびベンチマークを公開しています。プロジェクトページは以下です：https://rbler1234.github.io/OSTBench.github.io/

English

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

OST-Bench: オンライン時空間シーン理解におけるMLLMの能力評価

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

要旨

Support