OST-Bench: 온라인 시공간 장면 이해에서의 MLLM 능력 평가

초록

최근 멀티모달 대형 언어 모델(MLLMs)의 발전은 시각과 언어를 통합하여 복잡한 추론을 수행하는 데 있어 놀라운 능력을 보여주고 있습니다. 기존 대부분의 벤치마크는 사전에 기록된 고정된 입력 세트를 사용하여 오프라인 설정에서 모델을 평가하지만, 우리는 OST-Bench를 소개합니다. 이 벤치마크는 에이전트가 장면을 능동적으로 탐색하는 관점에서 온라인 시공간 이해를 평가하도록 설계되었습니다. '온라인' 측면은 점진적으로 획득된 관찰을 처리하고 추론할 필요성을 강조하며, '시공간' 구성 요소는 현재의 시각적 입력과 과거의 기억을 통합하여 동적인 공간 추론을 지원해야 합니다. OST-Bench는 실제 세계의 체화된 인식이 직면하는 도전을 더 잘 반영합니다. 효율적인 데이터 수집 파이프라인을 기반으로 구축된 OST-Bench는 ScanNet, Matterport3D, ARKitScenes에서 수집된 1.4k개의 장면과 10k개의 질문-답변 쌍으로 구성됩니다. 우리는 OST-Bench에서 여러 선도적인 MLLMs를 평가했으며, 복잡한 시공간 추론이 필요한 작업에서 이들이 부족함을 관찰했습니다. 온라인 설정에서 탐색 범위가 확장되고 기억이 증가함에 따라 정확도가 감소했습니다. 추가적인 실험적 분석을 통해 모델 간의 공통적인 오류 패턴을 식별했으며, 복잡한 단서 기반 공간 추론 요구사항과 장기 기억 검색 요구사항이 두 개의 별도 축에서 모델 성능을 크게 저하시키는 것을 발견했습니다. 이는 온라인 체화된 추론을 개선하기 위해 해결해야 할 핵심 과제를 강조합니다. 해당 분야의 연구와 개발을 촉진하기 위해 우리의 코드, 데이터셋, 벤치마크를 공개합니다. 프로젝트 페이지는 https://rbler1234.github.io/OSTBench.github.io/에서 확인할 수 있습니다.

English

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

OST-Bench: 온라인 시공간 장면 이해에서의 MLLM 능력 평가

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

초록

Support