VideoScore2:生成式视频评估中的审慎评分之道
VideoScore2: Think before You Score in Generative Video Evaluation
September 26, 2025
作者: Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen
cs.AI
摘要
近期,文本到视频生成技术取得了显著进展,生成的视频内容愈发逼真且多样化。然而,由于视频质量评估涉及视觉质量、语义对齐及物理一致性等多维度特性,对其进行有效评估仍面临根本性挑战。现有的评估工具和奖励模型多局限于单一的不透明评分,缺乏可解释性,或仅提供粗略分析,难以全面捕捉视频质量评估的复杂性。为此,我们推出了VideoScore2,一个多维度、可解释且与人类评价标准对齐的框架,它能够明确评估视觉质量、文本到视频的对齐度以及物理/常识一致性,并生成详细的思维链推理过程。该模型基于大规模数据集VideoFeedback2进行训练,该数据集包含27,168个人工标注的视频,涵盖三个维度的评分及推理轨迹。我们采用了两阶段训练流程,先进行监督微调,随后通过群体相对策略优化(GRPO)进行强化学习,以增强分析的鲁棒性。大量实验表明,VideoScore2在内部基准VideoScore-Bench-v2上实现了44.35(+5.94)的准确率,在四个外部基准(如VideoGenReward-Bench、VideoPhy2等)上的平均表现达到50.37(+4.32),同时提供了可解释的评估结果,通过有效的奖励建模为最佳N采样(Best-of-N sampling)搭建了评估与可控生成之间的桥梁。项目页面:https://tiger-ai-lab.github.io/VideoScore2/
English
Recent advances in text-to-video generation have produced increasingly
realistic and diverse content, yet evaluating such videos remains a fundamental
challenge due to their multi-faceted nature encompassing visual quality,
semantic alignment, and physical consistency. Existing evaluators and reward
models are limited to single opaque scores, lack interpretability, or provide
only coarse analysis, making them insufficient for capturing the comprehensive
nature of video quality assessment. We present VideoScore2, a
multi-dimensional, interpretable, and human-aligned framework that explicitly
evaluates visual quality, text-to-video alignment, and physical/common-sense
consistency while producing detailed chain-of-thought rationales. Our model is
trained on a large-scale dataset VideoFeedback2 containing 27,168
human-annotated videos with both scores and reasoning traces across three
dimensions, using a two-stage pipeline of supervised fine-tuning followed by
reinforcement learning with Group Relative Policy Optimization (GRPO) to
enhance analytical robustness. Extensive experiments demonstrate that
VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our
in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance
across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc),
while providing interpretable assessments that bridge the gap between
evaluation and controllable generation through effective reward modeling for
Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/