ETVA:通过细粒度问题生成与回答评估文本-视频对齐度
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
March 21, 2025
作者: Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song
cs.AI
摘要
在文本到视频(T2V)生成领域,精确评估文本提示与生成视频之间的语义对齐仍是一大挑战。现有的文本-视频对齐度量方法,如CLIPScore,仅能提供粗粒度的评分,缺乏细粒度的对齐细节,难以与人类偏好保持一致。为克服这一局限,我们提出了ETVA,一种通过细粒度问题生成与回答来评估文本-视频对齐的新方法。首先,一个多智能体系统将提示解析为语义场景图,进而生成原子问题。随后,我们设计了一个知识增强的多阶段推理框架用于问题回答,其中辅助大语言模型(LLM)先检索相关常识知识(如物理定律),接着视频LLM通过多阶段推理机制回答生成的问题。大量实验表明,ETVA的斯皮尔曼相关系数达到58.47,与人类判断的相关性远高于现有度量方法的31.0。我们还构建了一个专门用于文本-视频对齐评估的综合基准,包含2千个多样化的提示和跨越10个类别的1万2千个原子问题。通过对15个现有文本-视频模型的系统评估,我们识别了它们的关键能力与局限,为下一代T2V生成铺平了道路。
English
Precisely evaluating semantic alignment between text prompts and generated
videos remains a challenge in Text-to-Video (T2V) Generation. Existing
text-to-video alignment metrics like CLIPScore only generate coarse-grained
scores without fine-grained alignment details, failing to align with human
preference. To address this limitation, we propose ETVA, a novel Evaluation
method of Text-to-Video Alignment via fine-grained question generation and
answering. First, a multi-agent system parses prompts into semantic scene
graphs to generate atomic questions. Then we design a knowledge-augmented
multi-stage reasoning framework for question answering, where an auxiliary LLM
first retrieves relevant common-sense knowledge (e.g., physical laws), and then
video LLM answers the generated questions through a multi-stage reasoning
mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's
correlation coefficient of 58.47, showing a much higher correlation with human
judgment than existing metrics which attain only 31.0. We also construct a
comprehensive benchmark specifically designed for text-to-video alignment
evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10
categories. Through a systematic evaluation of 15 existing text-to-video
models, we identify their key capabilities and limitations, paving the way for
next-generation T2V generation.Summary
AI-Generated Summary