超大規模影片推理套件

摘要

影片模型的快速進展主要聚焦於視覺品質，其推理能力尚未得到充分探索。影片推理將智能基礎建立於時空一致的視覺環境中，這種環境超越了文字能自然捕捉的範疇，使模型能對時空結構（如連續性、互動性與因果關係）進行直覺推理。然而，由於缺乏大規模訓練數據，系統性研究影片推理及其擴展規律的進展受阻。為解決此問題，我們推出超大規模影片推理數據集（VBVR），該資源規模空前，涵蓋按原則性分類法整理的200種推理任務與超過百萬支影片片段，較現有數據集規模擴大約三個數量級。我們進一步提出VBVR基準測試平台，這套可驗證的評估框架突破基於模型評判的傳統，整合規則化且與人類判斷對齊的評分機制，實現可重現、可詮釋的影片推理能力診斷。借助VBVR系列工具，我們開展了首個大規模影片推理擴展研究，並觀察到模型對未見過推理任務出現早期湧現泛化跡象。VBVR為可泛化影片推理的下一階段研究奠定了基礎。數據、基準測試工具包與模型已公開於 https://video-reason.com/。

English

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .

超大規模影片推理套件

A Very Big Video Reasoning Suite

摘要

Support