OST-Bench：评估多模态大语言模型在在线时空场景理解中的能力

摘要

近期，多模态大语言模型（MLLMs）在融合视觉与语言进行复杂推理方面展现了显著能力。尽管现有基准大多在离线环境下评估模型，使用一组固定的预录输入，我们推出了OST-Bench，这是一个旨在从主动探索场景的智能体视角评估在线时空理解的基准。在线特性强调了对逐步获取的观察数据进行处理和推理的需求，而时空组件则要求将当前视觉输入与历史记忆相结合，以支持动态空间推理。OST-Bench更好地反映了现实世界具身感知的挑战。基于高效的数据收集流程，OST-Bench包含了来自ScanNet、Matterport3D和ARKitScenes的1.4千个场景和1万个问答对。我们评估了多个领先的MLLMs在OST-Bench上的表现，发现它们在需要复杂时空推理的任务上表现欠佳。在线设置下，随着探索范围扩大和记忆增长，其准确性下降。通过进一步的实验分析，我们识别了模型间的常见错误模式，发现基于复杂线索的空间推理需求和长期记忆检索需求分别在两个维度上显著降低了模型性能，凸显了提升在线具身推理能力必须解决的核心挑战。为促进该领域的进一步研究与开发，我们的代码、数据集及基准均已公开。项目页面请访问：https://rbler1234.github.io/OSTBench.github.io/

English

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/