EvalVerse: 面向专业影视视频生成的流水线感知与专家校准基准测试

摘要

生成式视频基础模型的快速发展正将这一领域推向专业级电影合成的高度。为达到如此严苛的质量要求，社区正转向强化学习（RL）与智能体工作流。然而，可靠评估已成为关键瓶颈。现有基准主要评测“是否正确”（基础指令遵循），而从根本上忽略了“是否优秀”（电影质感、演技与美学）。此外，当前自动化指标缺乏领域特定的严谨性，无法提供可信信号，导致人类审美感知与机器评分之间存在严重的可信度鸿沟。为弥合这一差距，我们提出了EvalVerse——一个全面的、面向流程的、经专家校准的评估框架。我们将视频生成评估不仅视为工程任务，更视为核心科学问题：对主观电影专业知识的系统数字化。首先，我们将领域知识组织为与专业电影制作流程（前期制作、制作与后期制作）对齐的评估分类体系。其次，我们将人类专家判断提炼为带有大规模人工标注的精选数据集。第三，我们通过专家校准的微调策略，将这些知识注入视觉语言模型（VLM），使其能够执行显式的思维链推理。与以往工作相比，EvalVerse不仅保持与基础“正确性”指标的兼容性，还显著扩展了“优秀性”标准，并将任务覆盖范围拓展至复杂的多镜头序列与视听整合。因此，通过提供细粒度的诊断信号，EvalVerse超越了静态排行榜，为奖励模型和评估智能体等未来工作奠定了基本基础设施。

English

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.