EvalVerse: パイプラインを考慮した専門家校正済みベンチマーキング - プロフェッショナルな映画風動画生成のための

要旨

生成的動画基盤モデルの急速な進化により、この分野はプロ級の映画品質合成へと押し進められている。このような高度な品質を達成するため、研究コミュニティは強化学習（RL）やエージェント型ワークフローへと移行しつつある。しかし、信頼性の高い評価が重大なボトルネックとして浮上している。既存のベンチマークは主として「それが正しいかどうか」（基本的なプロンプト追従）を評価する一方で、「それが良いかどうか」（映画品質、演技、美学）を根本的に無視している。さらに、現在の自動指標は、信頼できるシグナルを提供するために必要なドメイン固有の厳密性を欠いており、人間の美的知覚と機械のスコアリングの間には深刻な信頼性ギャップが生じている。このギャップを埋めるために、我々はEvalVerseを導入する。これは、包括的で、パイプラインを認識し、専門家によって調整された評価フレームワークである。我々は動画生成評価を単なる工学的タスクとしてではなく、根本的な科学問題、すなわち主観的な映画制作の専門知識を体系的にデジタル化するものとして扱う。第一に、ドメイン知識をプロフェッショナルな映画制作ワークフロー（プリプロダクション、プロダクション、ポストプロダクション）に沿った評価分類法として整理する。第二に、人間の専門家による判断を大規模な人手アノテーション付きキュレーションデータセットに蒸留する。第三に、この知識を専門家調整されたファインチューニング戦略を通じて視覚言語モデル（VLM）に注入し、VLMが明示的な思考連鎖推論を実行できるようにする。これまでの研究と比較して、EvalVerseは基礎的な「正しさ」の指標との互換性を維持するだけでなく、「良さ」へと基準を大幅に拡張し、複雑なマルチショットシーケンスや視聴覚統合へとタスク範囲を広げる。その結果、詳細な診断シグナルを提供することにより、EvalVerseは静的なリーダーボードを超越し、将来の研究（報酬モデルや評価エージェントなど）のための基盤インフラを確立する。

English

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.