RISE-Video:视频生成器能否解读隐含的世界规则?
RISE-Video: Can Video Generators Decode Implicit World Rules?
February 5, 2026
作者: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
cs.AI
摘要
尽管生成式视频模型已实现卓越的视觉保真度,但其对隐式世界规则的内化与推理能力仍是关键却尚未充分探索的前沿领域。为弥补这一空白,我们推出RISE-Video——首个面向推理的文本-图像到视频生成基准,将评估重点从表层美学转向深度认知推理。该基准包含467个经人工精细标注的样本,涵盖八大严谨类别,为探究模型在常识理解、空间动态及专业领域等多维度的智能水平提供了结构化测试平台。我们提出的多维评估框架包含四项指标:推理对齐度、时序一致性、物理合理性及视觉质量。为进一步支持可扩展评估,还创新性地利用大型多模态模型构建自动化流水线以模拟人类中心化评估。通过对11个前沿TI2V模型的广泛实验,发现现有模型在隐式约束下模拟复杂场景时存在普遍缺陷,这为未来世界模拟生成模型的演进提供了关键洞见。
English
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.