OVO-S-Bench: 멀티모달 LLM에서의 스트리밍 공간 지능을 위한 계층적 벤치마크

초록

로보틱스, 증강현실(AR), 자율주행 분야의 멀티모달 에이전트는 연속적인 자아중심 스트림으로부터 장소와 배치를 추론해야 하며, 종종 현재 시야 밖의 증거를 활용한다. 기존 벤치마크는 전체 비디오를 대상으로 오프라인 평가를 수행하거나 사건 자체보다는 공간 구조보다는 사건을 대상으로 한다. 본 연구에서는 스트리밍 공간 지능을 위한 완전히 사람이 주석을 단 벤치마크인 OVO-S-Bench를 소개한다. 이는 348개의 원본 비디오에서 추출한 1,680개의 질문으로 구성된다. 주석 작업은 12명의 훈련된 주석자가 참여했으며, 각 주석자는 맹검 상호 검토자 역할도 수행하여 총 약 804인시(person-hours)에 걸친 다회차 품질 보증 과정을 거쳤다. 각 질문에는 질의 타임스탬프와 증거 구간이 포함되며, 평가 시 모델은 질의 시점 이전의 프리픽스(prefix)만 확인할 수 있다. 질문은 추상화 수준이 증가하는 네 가지 수준, 즉 순간적 자아중심 지각, 시공간 맥락 추적, 공간 시뮬레이션 및 추론, 타자중심(allocentric) 매핑으로 구성된다. 38개의 독점 및 오픈소스 MLLM을 평가한 결과, Gemini-3.1-Pro는 인간 전문가 대비 27점 차이(59.2 대 86.6)를 보였으며, 타자중심 매핑이 주된 병목 지점으로 나타났다. 주목할 점은 스트리밍 및 공간 미세 조정된 MLLM이 자체 백본보다 낮은 성능을 보인다는 것이다. 또한 사고 사슬 추론(chain-of-thought reasoning)은 스트림에 근거하지 않을 경우 공간 오류를 증폭시키는 것으로 확인되었다. 이러한 한계를 드러냄으로써 OVO-S-Bench는 차세대 스트리밍 공간 MLLM을 위한 까다로운 테스트베드를 구축한다.

English

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.