ChatPaper.aiChatPaper

RULER-Bench:面向视觉基础智能的新一代视频生成模型规则推理能力测评基准

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

December 2, 2025
作者: Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu
cs.AI

摘要

近期视频生成技术的突破性进展使得合成视频具备高度时序一致性与惊艳的视觉质量,这标志着向视觉基础模型迈出了关键一步。为评估此类视频生成模型,现有基准主要关注视觉感知与理解相关要素,如视觉美学、指令遵循度和时序连贯性。然而,视频生成模型的规则推理能力仍属未充分探索的领域。尽管近期研究对视频模型能否作为零样本学习者进行了初步探索,但仍缺乏对推理能力的细粒度解构与系统化评估方案。为此,我们推出RULER-Bench基准测试框架,旨在从认知规则视角评估视频生成模型的推理能力。该框架基于文本到视频和图像到视频两大基础范式,涵盖六大规则类别下的40项代表性任务,包含622个高质量标注实例。针对每个生成视频的评估,我们构建了覆盖四项指标的检查清单,并利用GPT-3对每个问题进行自动评分,其与人工评判的一致性达到85%。大量实验表明,当前最优模型在规则连贯性指标上仅达到48.87%的得分,这凸显了下一代视频模型在推理能力方面存在显著提升空间。我们期望通过RULER-Bench获得的洞见能推动具有推理意识的视频生成技术发展,助力视频生成模型向视觉基础智能迈进。
English
Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
PDF71December 4, 2025