SVI-Bench: 전략적 비디오 지능을 위한 동적 마이크로월드

초록

진정한 비디오 지능은 보이는 것을 인식하는 것 이상을 요구한다. 즉, 사건이 전개되는 이유에 대한 추론, 다른 조건에서 어떤 변화가 일어날지 예측, 그리고 다음에 무엇을 해야 할지 결정하는 능력이 필요하다. 우리는 이러한 인식에서 인과 추론 및 시뮬레이션을 거쳐 전략적 계획으로 이어지는 발전 과정을 전략적 비디오 지능(Strategic Video Intelligence, SVI)이라고 명명한다. 기존의 어떤 벤치마크도 이러한 능력 스택을 평가하지 않는다. 실제 현장 영상은 인과 및 전략적 질문에 대한 검증 가능한 실제 정답이 부족한 반면, 합성 환경은 실제 다중 에이전트 시스템의 복잡성을 희생한다. 이러한 간극을 해소하기 위해, 우리는 팀 스포츠를 역동적인 마이크로 월드로 활용하는 대규모 벤치마크인 SVI-Bench를 소개한다. SVI-Bench는 실제 다중 에이전트 상호작용의 복잡성(10~22명의 에이전트가 적대적 압박 속에서 조정된 결정을 내림)과 명시적 규칙 및 확정적 결과의 검증 가능성을 결합한다. SVI-Bench는 농구, 축구, 아이스하키를 대상으로 약 35,000시간의 방송 영상, 1,500만 개의 주석이 달린 액션, 15,000시간의 전문가 해설, 23,000개의 경기 보고서, 103,000개의 구조화된 통계 기록을 포함하며, 이 모든 것은 원시 경기 데이터를 조밀하고 상호 참조되는 코퍼스로 변환하는 데이터 엔진을 통해 구축되었다. 우리는 평가를 4단계 점진적 계층 구조(역동적 장면 이해, 인과 추론, 전략적 시뮬레이션, 에이전트 합성)에 걸친 9개의 태스크로 구성한다. 강력한 멀티모달 및 에이전틱 기준 모델을 평가한 결과, 능력 절벽(capability cliff)을 발견했다. 모델들은 지각적 태스크에서는 유능한 성능을 보여 세부 액션 QA에서 약 73%의 정확도를 달성하지만, 인지 수준이 높아질수록 급격히 성능이 저하된다. 에이전틱 태스크가 가장 어려운 것으로 드러났으며, 가장 강력한 모델도 180만 개의 클립으로 구성된 코퍼스에서 자율적으로 증거를 수집하고 통합해야 할 때 단 5%의 정확도만을 기록했다.

English

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.