VideoZeroBench: 시공간 증거 검증을 통한 비디오 MLLM의 한계 탐구

초록

최근의 비디오 멀티모달 대규모 언어 모델들은 다양한 벤치마크에서 인상적인 성과를 달성하고 있습니다. 그러나 현재의 평가 방식은 두 가지 중요한 한계점을 지니고 있습니다: (1) 과장된 점수가 세부적인 시각적 이해와 추론 능력의 부족을 가릴 수 있으며, (2) 모델이 자신의 예측을 뒷받침하는 정확한 시공간적 증거를 식별했는지 검증하지 않은 채 답변의 정확성만 측정하는 경우가 많습니다. 이를 해결하기 위해, 우리는 도전적인 장영상 질의응답을 위해 설계되고 시공간적 증거를 엄격히 검증하는 계층적 벤치마크인 VideoZeroBench를 제안합니다. 이 벤치마크는 13개 분야에 걸친 500개의 수동 주석 처리된 질문으로 구성되며, 증거로 사용될 시간적 구간과 공간적 바운딩 박스와 짝을 이룹니다. 답변 생성, 시간적 위치 특정, 공간적 위치 특정 능력을 분리하여 평가하기 위해, 증거 요구 사항을 점진적으로 강화하는 5단계 평가 프로토콜을 도입했습니다. 실험 결과, Gemini-3-Pro 조차도 표준 종단간 QA 설정(Level-3)에서 17% 미만의 질문에만 정확히 답변하는 것으로 나타났습니다. 위치 특정 제약 조건이 부과되면 성능은 급격히 하락했습니다: 정확한 답변과 정밀한 시공간적 위치 특정이 모두 요구될 때(Level-5) 어떤 모델도 1%의 정확도를 넘지 못했으며, 대부분의 모델은 정확하게 위치가 특정된 예측을 단 한 건도 성공하지 못했습니다. 이러한 결과는 표면적인 답변 정확도와 진정한 증거 기반 추론 사이에 상당한 격차가 있음을 보여주며, 근거 기반 비디오 이해가 장영상 QA의 주요 병목 현상으로 남아있음을 드러냅니다. 우리는 추가로 최소 증거 범위, 기본 능력, 추론 패러다임에 따른 성능을 분석하여 근거 기반 비디오 추론의 미래 연구를 위한 통찰을 제공합니다. 벤치마크와 코드는 공개될 예정입니다.

English

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

VideoZeroBench: 시공간 증거 검증을 통한 비디오 MLLM의 한계 탐구

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

초록

Support