ChatPaper.aiChatPaper

ETVA:通過細粒度問題生成與回答評估文本到視頻對齊

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

March 21, 2025
作者: Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song
cs.AI

摘要

精確評估文本提示與生成視頻之間的語義對齊仍然是文本到視頻(T2V)生成中的一個挑戰。現有的文本到視頻對齊指標(如CLIPScore)僅生成粗粒度的分數,缺乏細粒度的對齊細節,無法與人類偏好保持一致。為了解決這一限制,我們提出了ETVA,一種通過細粒度問題生成和回答來評估文本到視頻對齊的新方法。首先,一個多代理系統將提示解析為語義場景圖以生成原子問題。然後,我們設計了一個知識增強的多階段推理框架來回答這些問題,其中一個輔助的大型語言模型(LLM)首先檢索相關的常識知識(例如物理定律),然後視頻LLM通過多階段推理機制回答生成的問題。大量實驗表明,ETVA的Spearman相關係數達到58.47,顯示出與人類判斷的相關性遠高於現有指標的31.0。我們還構建了一個專門用於文本到視頻對齊評估的綜合基準,包含2k個多樣化的提示和12k個跨越10個類別的原子問題。通過對15個現有文本到視頻模型的系統評估,我們識別了它們的關鍵能力和局限性,為下一代T2V生成鋪平了道路。
English
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.

Summary

AI-Generated Summary

PDF112March 24, 2025