VF-Eval:評估多模態LLM在生成AIGC影片反饋上的表現
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
May 29, 2025
作者: Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao
cs.AI
摘要
多模态大語言模型(MLLMs)在視頻問答領域近期得到了廣泛研究。然而,現有評估大多聚焦於自然視頻,而忽略了合成視頻,如人工智能生成內容(AIGC)。與此同時,部分視頻生成工作依賴MLLMs來評估生成視頻的質量,但MLLMs在解讀AIGC視頻方面的能力仍鮮有探索。為此,我們提出了一個新基準——VF-Eval,它引入了四項任務——連貫性驗證、錯誤意識、錯誤類型檢測和推理評估——以全面評估MLLMs在AIGC視頻上的能力。我們在VF-Eval上評估了13個前沿MLLMs,發現即使是表現最佳的模型GPT-4.1,也難以在所有任務中持續保持優異表現,這凸顯了我們基準的挑戰性。此外,為探討VF-Eval在提升視頻生成中的實際應用,我們進行了一項名為RePrompt的實驗,證明使MLLMs更緊密地對齊人類反饋,能對視頻生成有所裨益。
English
MLLMs have been widely studied for video question answering recently.
However, most existing assessments focus on natural videos, overlooking
synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in
video generation rely on MLLMs to evaluate the quality of generated videos, but
the capabilities of MLLMs on interpreting AIGC videos remain largely
underexplored. To address this, we propose a new benchmark, VF-Eval, which
introduces four tasks-coherence validation, error awareness, error type
detection, and reasoning evaluation-to comprehensively evaluate the abilities
of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that
even the best-performing model, GPT-4.1, struggles to achieve consistently good
performance across all tasks. This highlights the challenging nature of our
benchmark. Additionally, to investigate the practical applications of VF-Eval
in improving video generation, we conduct an experiment, RePrompt,
demonstrating that aligning MLLMs more closely with human feedback can benefit
video generation.Summary
AI-Generated Summary