ETVA:通過細粒度問題生成與回答評估文本到視頻對齊
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
March 21, 2025
作者: Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song
cs.AI
摘要
精確評估文本提示與生成視頻之間的語義對齊仍然是文本到視頻(T2V)生成中的一個挑戰。現有的文本到視頻對齊指標(如CLIPScore)僅生成粗粒度的分數,缺乏細粒度的對齊細節,無法與人類偏好保持一致。為了解決這一限制,我們提出了ETVA,一種通過細粒度問題生成和回答來評估文本到視頻對齊的新方法。首先,一個多代理系統將提示解析為語義場景圖以生成原子問題。然後,我們設計了一個知識增強的多階段推理框架來回答這些問題,其中一個輔助的大型語言模型(LLM)首先檢索相關的常識知識(例如物理定律),然後視頻LLM通過多階段推理機制回答生成的問題。大量實驗表明,ETVA的Spearman相關係數達到58.47,顯示出與人類判斷的相關性遠高於現有指標的31.0。我們還構建了一個專門用於文本到視頻對齊評估的綜合基準,包含2k個多樣化的提示和12k個跨越10個類別的原子問題。通過對15個現有文本到視頻模型的系統評估,我們識別了它們的關鍵能力和局限性,為下一代T2V生成鋪平了道路。
English
Precisely evaluating semantic alignment between text prompts and generated
videos remains a challenge in Text-to-Video (T2V) Generation. Existing
text-to-video alignment metrics like CLIPScore only generate coarse-grained
scores without fine-grained alignment details, failing to align with human
preference. To address this limitation, we propose ETVA, a novel Evaluation
method of Text-to-Video Alignment via fine-grained question generation and
answering. First, a multi-agent system parses prompts into semantic scene
graphs to generate atomic questions. Then we design a knowledge-augmented
multi-stage reasoning framework for question answering, where an auxiliary LLM
first retrieves relevant common-sense knowledge (e.g., physical laws), and then
video LLM answers the generated questions through a multi-stage reasoning
mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's
correlation coefficient of 58.47, showing a much higher correlation with human
judgment than existing metrics which attain only 31.0. We also construct a
comprehensive benchmark specifically designed for text-to-video alignment
evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10
categories. Through a systematic evaluation of 15 existing text-to-video
models, we identify their key capabilities and limitations, paving the way for
next-generation T2V generation.Summary
AI-Generated Summary