UVE：多模态大语言模型能否成为AI生成视频的统一评估者？

摘要

隨著視頻生成模型（VGMs）的快速發展，開發可靠且全面的自動評估指標來衡量AI生成視頻（AIGVs）變得至關重要。現有的方法要麼使用針對其他任務優化的現成模型，要麼依賴人類評估數據來訓練專用評估器。這些方法僅限於特定的評估維度，難以隨著對更細粒度和更全面評估需求的增長而擴展。為解決這一問題，本研究探討了利用多模態大語言模型（MLLMs）作為AIGVs統一評估器的可行性，充分利用其強大的視覺感知和語言理解能力。為了評估自動指標在統一AIGV評估中的表現，我們引入了一個名為UVE-Bench的基準測試。UVE-Bench收集了由最先進的VGMs生成的視頻，並提供了涵蓋15個評估維度的成對人類偏好註釋。基於UVE-Bench，我們對16個MLLMs進行了廣泛評估。實驗結果表明，雖然先進的MLLMs（如Qwen2VL-72B和InternVL2.5-78B）仍落後於人類評估者，但它們在統一AIGV評估中展現出顯著的潛力，大幅超越了現有的專用評估方法。此外，我們深入分析了影響MLLM驅動評估器性能的關鍵設計選擇，為未來AIGV評估研究提供了寶貴的見解。代碼可在https://github.com/bytedance/UVE獲取。

English

With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at https://github.com/bytedance/UVE.

UVE：多模态大语言模型能否成为AI生成视频的统一评估者？

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

摘要

Support