OVO-S-Bench：多模態大語言模型中流式空間智能的層次化基準

摘要

機器人、擴增實境（AR）與自動駕駛領域中的多模態代理，必須透過連續的自我中心視角流來推理場所與佈局，且常需運用當前視野之外的證據。現有基準測試若非在完整影片上進行離線評估，便是針對事件而非空間結構。我們提出 OVO-S-Bench，一個全人工標註的串流空間智能基準測試，涵蓋 348 部來源影片中的 1,680 個問題。標註工作由 12 位受訓標註員完成，每位同時擔任盲審交叉審查者，總計耗費約 804 人時進行多輪品質保證。每個問題附有查詢時間戳與證據區間，且評估時模型僅能看見查詢前的影片前綴。問題橫跨四個抽象層級：即時自我中心感知、時空情境追蹤、空間模擬與推理，以及異中心映射。在 38 個專有與開源的多模態大型語言模型（MLLM）中，Gemini-3.1-Pro 以 59.2 分落後人類專家的 86.6 分達 27 分，其中異中心映射是主要的瓶頸。值得注意的是，經過串流與空間微調的 MLLM 表現反而不如基礎骨幹模型。我們進一步發現，當鏈式思考推理未能扎根於串流時，會放大空間錯誤。透過揭露這些限制，OVO-S-Bench 為下一代串流空間 MLLM 建立了一個高難度的測試平台。

English

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.