SVI-Bench:一种用于策略视频智能的动态微世界
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
May 29, 2026
作者: Yulu Pan, Han Yi, Seongsu Ha, Md Mohaiminul Islam, Benjamin Zhang, Lorenzo Torresani, Gedas Bertasius
cs.AI
摘要
真正的视频智能不仅需要识别可见内容,更需要理解事件为何发生、预判不同条件下的变化并决定下一步行动。我们将这种从感知到因果推理与模拟、再到战略规划的进阶过程称为"战略视频智能"(Strategic Video Intelligence, SVI)。现有基准均无法评估这一能力体系:野外视频缺乏因果与战略问题的可验证真实标注,而合成环境又牺牲了真实多智能体系统的复杂性。为弥补这一空白,我们提出SVI-Bench——一个大规模基准,以团队体育作为动态微世界,既保留真实世界多智能体交互的复杂性(10-22名智能体在对抗压力下做出协调决策),又具备明确规则与确定性结果的验证性。该基准涵盖约3.5万小时转播视频、1500万个标注动作、1.5万小时专家解说、2.3万份比赛报告及10.3万条结构化统计记录,覆盖篮球、足球和冰球项目,所有数据均通过数据引擎将原始比赛数据转化为密集交叉引用的语料库。我们将评估划分为9项任务,构成渐进式四柱层级:动态场景理解、因果推理、战略模拟与智能体综合。对主流多模态与智能体基线模型的评估显示存在能力断层:模型在感知任务中表现尚可(细粒度动作问答准确率约73%),但每升至更高认知层级性能便急剧下降。其中智能体任务最为困难:最强模型在需自主收集并整合180万个片段语料证据时,准确率仅达5%。
English
True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.