ChatPaper.aiChatPaper

RISE-Video:影片生成器能否解碼隱含的世界規則?

RISE-Video: Can Video Generators Decode Implicit World Rules?

February 5, 2026
作者: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
cs.AI

摘要

雖然生成式影片模型已實現了卓越的視覺逼真度,但其對隱含世界規則的內化與推理能力仍是關鍵卻尚未充分探索的前沿領域。為彌合這一差距,我們推出RISE-Video——首個面向推理的文本圖像轉影片生成基準測試,將評估重點從表層美學轉向深度認知推理。RISE-Video包含467個經人工精細標註的樣本,涵蓋八個嚴謹類別,為探測模型在常識、空間動態到專業領域等多維度的智能表現提供結構化測試平台。我們的框架引入由四項指標組成的多維度評估協議:推理對齊性、時間一致性、物理合理性與視覺品質。為支持可擴展評估,我們進一步提出利用大型多模態模型模擬以人為本評估的自動化流程。對11個頂尖TI2V模型的廣泛實驗顯示,現有模型在隱含約束下模擬複雜場景時存在普遍缺陷,此發現為未來世界模擬生成模型的發展提供了關鍵洞見。
English
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
PDF232February 7, 2026