TC-Bench:在文本到視頻和圖像到視頻生成中對時間組成性進行基準測試
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
June 12, 2024
作者: Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang
cs.AI
摘要
影片生成面臨許多獨特挑戰,超越了圖像生成的範疇。
時間維度引入了跨幀廣泛可能的變化,
其中一致性和連續性可能被破壞。在這項研究中,我們超越評估簡單動作,主張生成的影片應該
融入新概念的出現和它們之間的過渡,就像現實世界中的影片隨著時間的推移一樣。為了評估影片生成模型的時間
組成性,我們提出了TC-Bench,一個精心製作的文本提示、相應的真實影片和
堅固的評估指標基準。這些提示清晰表達了場景的初始和最終狀態,
有效減少了幀發展的模糊性,簡化了過渡完成的評估。此外,通過收集
與提示相對應的對齊現實世界影片,我們將TC-Bench的應用範圍擴展到
可以執行生成幀插值的圖像條件模型。我們還開發了新的指標來衡量
生成影片中組件過渡的完整性,這些指標與人類判斷有較高的相關性
相比現有指標。我們全面的實驗結果顯示,大多數影片生成器實現的組成變化不到20%,
突顯了未來改進的巨大空間。我們的分析表明,當前的影片生成模型難以解釋組成變化的描述
並在不同時間步驟中綜合合成各種組件。
English
Video generation has many unique challenges beyond those of image generation.
The temporal dimension introduces extensive possible variations across frames,
over which consistency and continuity may be violated. In this study, we move
beyond evaluating simple actions and argue that generated videos should
incorporate the emergence of new concepts and their relation transitions like
in real-world videos as time progresses. To assess the Temporal
Compositionality of video generation models, we propose TC-Bench, a benchmark
of meticulously crafted text prompts, corresponding ground truth videos, and
robust evaluation metrics. The prompts articulate the initial and final states
of scenes, effectively reducing ambiguities for frame development and
simplifying the assessment of transition completion. In addition, by collecting
aligned real-world videos corresponding to the prompts, we expand TC-Bench's
applicability from text-conditional models to image-conditional ones that can
perform generative frame interpolation. We also develop new metrics to measure
the completeness of component transitions in generated videos, which
demonstrate significantly higher correlations with human judgments than
existing metrics. Our comprehensive experimental results reveal that most video
generators achieve less than 20% of the compositional changes, highlighting
enormous space for future improvement. Our analysis indicates that current
video generation models struggle to interpret descriptions of compositional
changes and synthesize various components across different time steps.