OVO-S-Bench: Een hiërarchische benchmark voor streaming ruimtelijke intelligentie in multimodale LLM's

Samenvatting

Multimodale agenten in robotica, augmented reality en autonoom rijden moeten redeneren over plaatsen en indelingen op basis van continue egocentrische stromen, waarbij ze vaak gebruikmaken van bewijs buiten het huidige blikveld. Bestaande benchmarks evalueren ofwel offline over volledige video's ofwel richten zich op gebeurtenissen in plaats van ruimtelijke structuur. We introduceren OVO-S-Bench, een volledig door mensen geannoteerde benchmark voor streaming ruimtelijke intelligentie, bestaande uit 1.680 vragen over 348 bronvideo's. De annotatie omvat 12 getrainde annotators, die elk ook dienen als blinde beoordelaar, verspreid over ongeveer 804 persoonsuren aan kwaliteitsborging in meerdere rondes. Elke vraag draagt een querytijdstip en een bewijsinterval, en bij evaluatie ziet het model alleen het voorvoegsel dat aan de query voorafgaat. De vragen bestrijken vier niveaus van toenemende abstractie: momentane egocentrische perceptie, spatiotemporele contextvolgorde, ruimtelijke simulatie en redenering, en allocentrische kartering. Over 38 propriëtaire en opensource-MLLM's heen presteert Gemini-3.1-Pro 27 punten onder menselijke experts, 59,2 vs. 86,6, waarbij allocentrische kartering de dominante bottleneck vormt. Opvallend is dat streaming en ruimtelijk fijn afgestemde MLLM's slechter presteren dan hun eigen basismodellen. Verder vinden we dat keten-van-gedachte-redenering ruimtelijke fouten versterkt wanneer deze niet in de stroom is geworteld. Door deze beperkingen bloot te leggen, creëert OVO-S-Bench een veeleisende testomgeving voor de volgende generatie streaming ruimtelijke MLLM's.

English

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.