VF-Eval：评估多模态大语言模型在生成AIGC视频反馈中的应用

摘要

近期，多模态大语言模型（MLLMs）在视频问答领域得到了广泛研究。然而，现有评估大多聚焦于自然视频，忽视了如AI生成内容（AIGC）等合成视频。同时，尽管部分视频生成工作依赖MLLMs来评估生成视频的质量，但MLLMs在解析AIGC视频方面的能力仍鲜有深入探讨。为此，我们提出了一个新基准——VF-Eval，通过引入四项任务——连贯性验证、错误识别、错误类型检测及推理评估——来全面评估MLLMs在AIGC视频上的表现。我们在VF-Eval上测试了13个前沿MLLMs，发现即便是表现最佳的GPT-4.1模型，也难以在所有任务中持续保持优异表现，这凸显了我们基准的挑战性。此外，为探索VF-Eval在提升视频生成实际应用中的价值，我们进行了RePrompt实验，证明使MLLMs更贴近人类反馈有助于优化视频生成。

English

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

VF-Eval：评估多模态大语言模型在生成AIGC视频反馈中的应用

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

摘要

Support