VF-Eval：評估多模態LLM在生成AIGC影片反饋上的表現

摘要

多模态大語言模型（MLLMs）在視頻問答領域近期得到了廣泛研究。然而，現有評估大多聚焦於自然視頻，而忽略了合成視頻，如人工智能生成內容（AIGC）。與此同時，部分視頻生成工作依賴MLLMs來評估生成視頻的質量，但MLLMs在解讀AIGC視頻方面的能力仍鮮有探索。為此，我們提出了一個新基準——VF-Eval，它引入了四項任務——連貫性驗證、錯誤意識、錯誤類型檢測和推理評估——以全面評估MLLMs在AIGC視頻上的能力。我們在VF-Eval上評估了13個前沿MLLMs，發現即使是表現最佳的模型GPT-4.1，也難以在所有任務中持續保持優異表現，這凸顯了我們基準的挑戰性。此外，為探討VF-Eval在提升視頻生成中的實際應用，我們進行了一項名為RePrompt的實驗，證明使MLLMs更緊密地對齊人類反饋，能對視頻生成有所裨益。

English

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

VF-Eval：評估多模態LLM在生成AIGC影片反饋上的表現

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

摘要

Support