RULER-Bench:探索下一代视频生成模型在视觉基础智能中的规则推理能力
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
December 2, 2025
作者: Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu
cs.AI
摘要
近期视频生成技术的进展使得合成视频具备强时序一致性与惊艳的视觉质量,这标志着向视觉基础模型迈出了关键一步。为评估此类视频生成模型,现有基准主要关注视觉感知与理解相关要素,如视觉美感、指令遵循度和时序连贯性。然而视频生成模型的规则推理能力仍属未充分探索的领域。尽管近期研究对视频模型能否作为零样本学习者进行了初步探索,但仍缺乏对推理能力的细粒度解构与系统化评估方案。为此,我们推出RULER-Bench基准,从认知规则视角评估视频生成模型的推理能力。该基准基于文本到视频和图像到视频两大基础范式,涵盖6大规则类别下的40项代表性任务,包含622个高质量标注实例。针对每个生成视频的评估,我们构建了覆盖四项指标的检查表,并利用GPT-3对每个问题自动评分,其与人工评判的一致性达85%。大规模实验表明,当前最先进模型在规则一致性指标上仅达到48.87%,凸显了新一代视频模型在推理能力方面存在巨大提升空间。我们期待通过RULER-Bench获得的洞见能推动具备推理意识的视频生成技术发展,促使视频生成模型向视觉基础智能迈进。
English
Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.