VF-Eval: AIGC動画に対するフィードバック生成のためのマルチモーダルLLMの評価

要旨

最近、MLLM（マルチモーダル大規模言語モデル）はビデオ質問応答の分野で広く研究されています。しかし、既存の評価のほとんどは自然なビデオに焦点を当てており、AI生成コンテンツ（AIGC）などの合成ビデオを見落としています。一方、ビデオ生成の分野では、生成されたビデオの品質を評価するためにMLLMを利用する研究もありますが、MLLMがAIGCビデオを解釈する能力についてはほとんど検証されていません。この問題に対処するため、我々は新しいベンチマーク「VF-Eval」を提案します。VF-Evalは、一貫性検証、エラー認識、エラータイプ検出、推論評価という4つのタスクを導入し、MLLMのAIGCビデオに対する能力を包括的に評価します。我々はVF-Evalを用いて13の最先端MLLMを評価し、最も性能の高いモデルであるGPT-4.1でさえ、すべてのタスクで一貫して良好な性能を発揮することが難しいことを明らかにしました。これは、我々のベンチマークの難易度の高さを示しています。さらに、VF-Evalがビデオ生成の改善にどのように役立つかを調査するため、RePromptという実験を実施しました。この実験では、MLLMを人間のフィードバックにより密接に連携させることで、ビデオ生成に有益であることを示しています。

English

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

VF-Eval: AIGC動画に対するフィードバック生成のためのマルチモーダルLLMの評価

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

要旨

Support