TC-Bench:在文本到视频和图像到视频生成中对时间组合性进行基准测试
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
June 12, 2024
作者: Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang
cs.AI
摘要
视频生成面临许多独特挑战,超越了图像生成的范畴。时间维度引入了跨帧广泛可能的变化,其中一致性和连续性可能会受到破坏。在这项研究中,我们超越了评估简单动作,并主张生成的视频应该融入新概念的出现以及它们之间的关系转变,就像现实世界的视频随着时间推移一样。为了评估视频生成模型的时间组合性,我们提出了TC-Bench,一个精心设计的文本提示、相应的真实视频和稳健评估指标的基准。这些提示阐明了场景的初始和最终状态,有效减少了帧发展的歧义,并简化了过渡完成的评估。此外,通过收集与提示相对应的现实世界视频,我们将TC-Bench的适用范围从文本条件模型扩展到可以执行生成帧插值的图像条件模型。我们还开发了新的度量标准来衡量生成视频中组件过渡的完整性,这些度量标准与人类判断之间的相关性明显更高。我们全面的实验结果显示,大多数视频生成器实现的组合变化不到20%,突显了未来改进的巨大空间。我们的分析表明,当前的视频生成模型难以解释组合变化的描述,并在不同时间步骤中综合各种组件。
English
Video generation has many unique challenges beyond those of image generation.
The temporal dimension introduces extensive possible variations across frames,
over which consistency and continuity may be violated. In this study, we move
beyond evaluating simple actions and argue that generated videos should
incorporate the emergence of new concepts and their relation transitions like
in real-world videos as time progresses. To assess the Temporal
Compositionality of video generation models, we propose TC-Bench, a benchmark
of meticulously crafted text prompts, corresponding ground truth videos, and
robust evaluation metrics. The prompts articulate the initial and final states
of scenes, effectively reducing ambiguities for frame development and
simplifying the assessment of transition completion. In addition, by collecting
aligned real-world videos corresponding to the prompts, we expand TC-Bench's
applicability from text-conditional models to image-conditional ones that can
perform generative frame interpolation. We also develop new metrics to measure
the completeness of component transitions in generated videos, which
demonstrate significantly higher correlations with human judgments than
existing metrics. Our comprehensive experimental results reveal that most video
generators achieve less than 20% of the compositional changes, highlighting
enormous space for future improvement. Our analysis indicates that current
video generation models struggle to interpret descriptions of compositional
changes and synthesize various components across different time steps.Summary
AI-Generated Summary