ETVA：通过细粒度问题生成与回答评估文本-视频对齐度

摘要

在文本到视频（T2V）生成领域，精确评估文本提示与生成视频之间的语义对齐仍是一大挑战。现有的文本-视频对齐度量方法，如CLIPScore，仅能提供粗粒度的评分，缺乏细粒度的对齐细节，难以与人类偏好保持一致。为克服这一局限，我们提出了ETVA，一种通过细粒度问题生成与回答来评估文本-视频对齐的新方法。首先，一个多智能体系统将提示解析为语义场景图，进而生成原子问题。随后，我们设计了一个知识增强的多阶段推理框架用于问题回答，其中辅助大语言模型（LLM）先检索相关常识知识（如物理定律），接着视频LLM通过多阶段推理机制回答生成的问题。大量实验表明，ETVA的斯皮尔曼相关系数达到58.47，与人类判断的相关性远高于现有度量方法的31.0。我们还构建了一个专门用于文本-视频对齐评估的综合基准，包含2千个多样化的提示和跨越10个类别的1万2千个原子问题。通过对15个现有文本-视频模型的系统评估，我们识别了它们的关键能力与局限，为下一代T2V生成铺平了道路。

English

Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.