朝着更好的文本到视频生成度量标准前进。
Towards A Better Metric for Text-to-Video Generation
January 15, 2024
作者: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
cs.AI
摘要
生成模型展示了在合成高质量文本、图像和视频方面的显著能力。对于视频生成,当代文本到视频模型展示出令人印象深刻的能力,创作出视觉上令人惊叹的视频。然而,评估这类视频存在着重大挑战。当前研究主要采用自动化指标,如FVD、IS和CLIP分数。然而,这些指标提供了不完整的分析,特别是在视频内容的时间评估方面,因此使它们成为真实视频质量的不可靠指标。此外,虽然用户研究有潜力准确反映人类感知,但受其耗时且费力的本质所限,结果往往会被主观偏见所影响。在本文中,我们研究了现有指标固有的局限性,并引入了一种新颖的评估流程,即文本到视频分数(T2VScore)。该指标整合了两个关键标准:(1)文本-视频对齐,审查视频在呈现给定文本描述方面的忠实度,以及(2)视频质量,评估视频的整体制作水平与专家意见的混合。此外,为了评估所提出的指标并促进对其未来改进,我们提出了TVGE数据集,收集了对2,543个文本到视频生成视频在这两个标准上的人类判断。在TVGE数据集上的实验表明,所提出的T2VScore在提供更好的文本到视频生成度量标准方面具有优越性。
English
Generative models have demonstrated remarkable capability in synthesizing
high-quality text, images, and videos. For video generation, contemporary
text-to-video models exhibit impressive capabilities, crafting visually
stunning videos. Nonetheless, evaluating such videos poses significant
challenges. Current research predominantly employs automated metrics such as
FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis,
particularly in the temporal assessment of video content, thus rendering them
unreliable indicators of true video quality. Furthermore, while user studies
have the potential to reflect human perception accurately, they are hampered by
their time-intensive and laborious nature, with outcomes that are often tainted
by subjective bias. In this paper, we investigate the limitations inherent in
existing metrics and introduce a novel evaluation pipeline, the Text-to-Video
Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video
Alignment, which scrutinizes the fidelity of the video in representing the
given text description, and (2) Video Quality, which evaluates the video's
overall production caliber with a mixture of experts. Moreover, to evaluate the
proposed metrics and facilitate future improvements on them, we present the
TVGE dataset, collecting human judgements of 2,543 text-to-video generated
videos on the two criteria. Experiments on the TVGE dataset demonstrate the
superiority of the proposed T2VScore on offering a better metric for
text-to-video generation.