VF-Eval: AIGC 비디오에 대한 피드백 생성을 위한 다중모달 LLM 평가

초록

최근 MLLM(Multimodal Large Language Models)은 비디오 질문 응답 분야에서 광범위하게 연구되고 있습니다. 그러나 기존의 대부분의 평가는 자연 영상에 초점을 맞추고 있으며, AI 생성 콘텐츠(AIGC)와 같은 합성 영상을 간과하고 있습니다. 한편, 비디오 생성 분야의 일부 연구에서는 생성된 비디오의 품질을 평가하기 위해 MLLM을 활용하고 있지만, MLLM이 AIGC 비디오를 해석하는 능력은 여전히 크게 탐구되지 않고 있습니다. 이를 해결하기 위해, 우리는 AIGC 비디오에 대한 MLLM의 능력을 종합적으로 평가하기 위해 일관성 검증, 오류 인식, 오류 유형 탐지, 추론 평가 등 네 가지 작업을 도입한 새로운 벤치마크인 VF-Eval을 제안합니다. 우리는 VF-Eval에서 13개의 최신 MLLM을 평가했으며, 가장 성능이 뛰어난 모델인 GPT-4.1조차도 모든 작업에서 일관되게 좋은 성능을 내는 데 어려움을 겪는 것을 발견했습니다. 이는 우리 벤치마크의 도전적인 특성을 강조합니다. 또한, VF-Eval이 비디오 생성 개선에 대한 실용적인 응용 가능성을 조사하기 위해, 우리는 RePrompt 실험을 수행하여 MLLM을 인간의 피드백과 더 밀접하게 정렬하는 것이 비디오 생성에 도움이 될 수 있음을 입증했습니다.

English

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

VF-Eval: AIGC 비디오에 대한 피드백 생성을 위한 다중모달 LLM 평가

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

초록

Support