ChatPaper.aiChatPaper

VideoScore2:生成式影片評估中的深思熟慮評分法

VideoScore2: Think before You Score in Generative Video Evaluation

September 26, 2025
作者: Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen
cs.AI

摘要

近期,文本到視頻生成技術的進步已能產出愈發逼真且多樣化的內容,然而,由於此類視頻在視覺質量、語義對齊及物理一致性等多方面的複雜性,其評估仍面臨根本性挑戰。現有的評估工具和獎勵模型僅限於提供單一且不透明的分數,缺乏可解釋性,或僅能進行粗略分析,這使得它們無法全面捕捉視頻質量評估的綜合特性。我們推出了VideoScore2,這是一個多維度、可解釋且與人類評判對齊的框架,它明確評估視覺質量、文本到視頻的對齊度以及物理/常識一致性,並生成詳細的思維鏈推理。我們的模型基於大規模數據集VideoFeedback2進行訓練,該數據集包含27,168個帶有分數及跨三個維度推理痕跡的人類註釋視頻,採用監督微調後接續使用群組相對策略優化(GRPO)進行強化學習的兩階段管道,以增強分析的魯棒性。大量實驗表明,VideoScore2在我們內部基準VideoScore-Bench-v2上達到了44.35(+5.94)的準確率,並在四個外部基準(如VideoGenReward-Bench、VideoPhy2等)上平均表現為50.37(+4.32),同時提供可解釋的評估,通過有效的獎勵建模為Best-of-N採樣搭建起評估與可控生成之間的橋樑。項目頁面:https://tiger-ai-lab.github.io/VideoScore2/
English
Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/
PDF202September 30, 2025