SVI-Bench：用於策略性視訊智慧的動態微世界

摘要

真正的视频智能不僅需要辨識畫面中的可見內容，更要求推理事件發生的原因、預測不同條件下可能發生的變化，並決定下一步行動。我們將這種從感知到因果推理與模擬、再到戰略規劃的能力進階，稱為「戰略視頻智能」（Strategic Video Intelligence, SVI）。現有基準測試均無法完整評估此能力堆疊：真實影片缺乏可驗證的因果與策略問題答案，而合成環境則犧牲了真實多智能體系統的複雜性。為填補此缺口，我們提出SVI-Bench大規模基準測試，將團隊運動作為動態微觀世界，融合真實世界多智能體互動的複雜性（10至22名智能體在對抗壓力下做出協調決策）與明確規則及確定結果的可驗證性。SVI-Bench包含約3.5萬小時的廣播影片、1500萬個標註動作、1.5萬小時專家解說、2.3萬份比賽報告，以及涵蓋籃球、足球與冰球的10.3萬筆結構化統計紀錄，全部經由數據引擎將原始比賽數據轉化為密集交叉參照的語料庫。評估架構分為九項任務，依循漸進四層級：動態場景理解、因果推理、戰略模擬與智能體綜合生成。經評估多模態與智能體強基線，我們發現能力斷崖：模型在感知任務表現尚可，細粒度動作問答準確率約達73%，但每向上一個認知層級即急遽下降。智能體任務最為困難：當需自主從180萬段影片語料庫中蒐集並整合證據時，最強模型僅達5%準確率。

English

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.