VideoScore2: 生成動画評価におけるスコアリングの前に考える

要旨

最近のテキストからビデオ生成の進展により、ますます現実的で多様なコンテンツが生み出されていますが、その評価は視覚的品質、意味的整合性、物理的一貫性といった多面的な性質のため、依然として根本的な課題となっています。既存の評価ツールや報酬モデルは、単一の不透明なスコアに限定されていたり、解釈可能性が欠如していたり、粗い分析しか提供しないため、ビデオ品質評価の包括的な性質を捉えるには不十分です。本論文では、VideoScore2を紹介します。これは、視覚的品質、テキストとビデオの整合性、物理的/常識的一貫性を明示的に評価し、詳細な思考の連鎖（chain-of-thought）の根拠を生成する、多次元的で解釈可能かつ人間の判断に沿ったフレームワークです。私たちのモデルは、27,168の人間が注釈を付けたビデオを含む大規模データセットVideoFeedback2で訓練され、3つの次元にわたるスコアと推論の痕跡を使用し、教師あり微調整の2段階パイプラインとGroup Relative Policy Optimization（GRPO）を用いた強化学習を通じて分析の堅牢性を高めています。広範な実験により、VideoScore2は、ドメイン内ベンチマークVideoScore-Bench-v2で44.35（+5.94）の精度を達成し、4つのドメイン外ベンチマーク（VideoGenReward-Bench、VideoPhy2など）で平均50.37（+4.32）の性能を示し、解釈可能な評価を提供することで、Best-of-Nサンプリングのための効果的な報酬モデリングを通じて評価と制御可能な生成の間のギャップを埋めることを実証しています。プロジェクトページ: https://tiger-ai-lab.github.io/VideoScore2/

English

Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/

VideoScore2: 生成動画評価におけるスコアリングの前に考える

VideoScore2: Think before You Score in Generative Video Evaluation

要旨

Support