OST-Bench：評估多模態大語言模型在線上時空場景理解中的能力

摘要

近期，多模态大語言模型（MLLMs）在整合視覺與語言以進行複雜推理方面展現了顯著的能力。儘管現有的大多數基準測試都是在離線設置下使用一組固定的預錄輸入來評估模型，我們引入了OST-Bench，這是一個旨在從主動探索場景的代理視角評估在線時空理解的基準測試。在線方面強調了處理和推理逐步獲取的觀察結果的需求，而時空組件則要求將當前的視覺輸入與歷史記憶相結合，以支持動態的空間推理。OST-Bench更好地反映了現實世界具身感知的挑戰。基於高效的數據收集管道，OST-Bench由來自ScanNet、Matterport3D和ARKitScenes的1.4k個場景和10k個問答對組成。我們在OST-Bench上評估了幾個領先的MLLMs，並觀察到它們在需要複雜時空推理的任務上表現不佳。在在線設置下，隨著探索範圍的擴大和記憶的增長，它們的準確性下降。通過進一步的實驗分析，我們識別了模型間的常見錯誤模式，並發現基於複雜線索的空間推理需求和長期記憶檢索需求分別顯著降低了模型性能，這凸顯了改進在線具身推理必須解決的核心挑戰。為了促進該領域的進一步研究和發展，我們的代碼、數據集和基準測試均已公開。我們的項目頁面是：https://rbler1234.github.io/OSTBench.github.io/

English

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/