ETVA：通過細粒度問題生成與回答評估文本到視頻對齊

摘要

精確評估文本提示與生成視頻之間的語義對齊仍然是文本到視頻（T2V）生成中的一個挑戰。現有的文本到視頻對齊指標（如CLIPScore）僅生成粗粒度的分數，缺乏細粒度的對齊細節，無法與人類偏好保持一致。為了解決這一限制，我們提出了ETVA，一種通過細粒度問題生成和回答來評估文本到視頻對齊的新方法。首先，一個多代理系統將提示解析為語義場景圖以生成原子問題。然後，我們設計了一個知識增強的多階段推理框架來回答這些問題，其中一個輔助的大型語言模型（LLM）首先檢索相關的常識知識（例如物理定律），然後視頻LLM通過多階段推理機制回答生成的問題。大量實驗表明，ETVA的Spearman相關係數達到58.47，顯示出與人類判斷的相關性遠高於現有指標的31.0。我們還構建了一個專門用於文本到視頻對齊評估的綜合基準，包含2k個多樣化的提示和12k個跨越10個類別的原子問題。通過對15個現有文本到視頻模型的系統評估，我們識別了它們的關鍵能力和局限性，為下一代T2V生成鋪平了道路。

English

Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.