EvalVerse：針對專業級電影式影片生成的管線感知與專家校準基準測試

摘要

生成式影片基礎模型的快速演進，已將該領域推向專業級電影級合成技術的發展。為達成如此嚴苛的品質標準，學界正逐步轉向強化學習（Reinforcement Learning, RL）與智能代理工作流程。然而，可靠的評量機制已成為關鍵瓶頸。現有基準測試主要評估「是否正確」（基礎提示遵循度），卻根本忽略了「是否優良」（電影品質、表演與美學）。此外，當前的自動化指標缺乏領域專業的嚴謹性，無法提供可信賴的評估訊號，導致人類美學感知與機器評分之間存在嚴重的可信度鴻溝。為彌補此落差，我們提出 EvalVerse——一個全面、具流程意識且經專家校準的評估框架。我們將影片生成評估不僅視為工程任務，更視為核心科學問題：即主觀電影專業知識的系統化數位化。首先，我們將領域知識組織為符合專業電影製作流程（前期製作、拍攝製作與後期製作）的評估分類體系。其次，我們將人類專家判斷提煉為具大規模人工標註的策展資料集。第三，透過專家校準的微調策略，將此知識注入視覺語言模型（Vision-Language Models, VLM），使其能執行明確的思考鏈推理。與先前研究相比，EvalVerse 不僅保留與基礎「正確性」指標的相容性，更大幅擴展至「優良性」標準，並拓展任務範疇至複雜的多鏡頭序列與視聽整合。因此，透過提供細顆粒度的診斷訊號，EvalVerse 超越靜態排行榜，為未來工作（如獎勵模型與評量代理）建立基礎基礎設施。

English

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.