UVE:多模态大语言模型能否成为AI生成视频的统一评估者?
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
March 13, 2025
作者: Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang
cs.AI
摘要
随着视频生成模型(VGMs)的快速发展,开发可靠且全面的自动评估指标对于AI生成视频(AIGVs)至关重要。现有方法要么使用针对其他任务优化的现成模型,要么依赖人工评估数据来训练专门的评估器。这些方法局限于特定的评估维度,难以随着对更细粒度和更全面评估需求的增长而扩展。为解决这一问题,本研究探讨了利用多模态大语言模型(MLLMs)作为AIGVs统一评估器的可行性,充分发挥其强大的视觉感知和语言理解能力。为了评估自动指标在统一AIGV评估中的表现,我们引入了一个名为UVE-Bench的基准。UVE-Bench收集了由最先进VGMs生成的视频,并提供了涵盖15个评估维度的成对人类偏好标注。利用UVE-Bench,我们广泛评估了16个MLLMs。实证结果表明,尽管先进的MLLMs(如Qwen2VL-72B和InternVL2.5-78B)仍落后于人类评估者,但它们在统一AIGV评估中展现出显著潜力,大幅超越了现有的专门评估方法。此外,我们深入分析了影响MLLM驱动评估器性能的关键设计选择,为未来AIGV评估研究提供了宝贵的见解。代码可在https://github.com/bytedance/UVE获取。
English
With the rapid growth of video generative models (VGMs), it is essential to
develop reliable and comprehensive automatic metrics for AI-generated videos
(AIGVs). Existing methods either use off-the-shelf models optimized for other
tasks or rely on human assessment data to train specialized evaluators. These
approaches are constrained to specific evaluation aspects and are difficult to
scale with the increasing demands for finer-grained and more comprehensive
evaluations. To address this issue, this work investigates the feasibility of
using multimodal large language models (MLLMs) as a unified evaluator for
AIGVs, leveraging their strong visual perception and language understanding
capabilities. To evaluate the performance of automatic metrics in unified AIGV
evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects
videos generated by state-of-the-art VGMs and provides pairwise human
preference annotations across 15 evaluation aspects. Using UVE-Bench, we
extensively evaluate 16 MLLMs. Our empirical results suggest that while
advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human
evaluators, they demonstrate promising ability in unified AIGV evaluation,
significantly surpassing existing specialized evaluation methods. Additionally,
we conduct an in-depth analysis of key design choices that impact the
performance of MLLM-driven evaluators, offering valuable insights for future
research on AIGV evaluation. The code is available at
https://github.com/bytedance/UVE.Summary
AI-Generated Summary