TC-Bench：在文本到視頻和圖像到視頻生成中對時間組成性進行基準測試

摘要

影片生成面臨許多獨特挑戰，超越了圖像生成的範疇。時間維度引入了跨幀廣泛可能的變化，其中一致性和連續性可能被破壞。在這項研究中，我們超越評估簡單動作，主張生成的影片應該融入新概念的出現和它們之間的過渡，就像現實世界中的影片隨著時間的推移一樣。為了評估影片生成模型的時間組成性，我們提出了TC-Bench，一個精心製作的文本提示、相應的真實影片和堅固的評估指標基準。這些提示清晰表達了場景的初始和最終狀態，有效減少了幀發展的模糊性，簡化了過渡完成的評估。此外，通過收集與提示相對應的對齊現實世界影片，我們將TC-Bench的應用範圍擴展到可以執行生成幀插值的圖像條件模型。我們還開發了新的指標來衡量生成影片中組件過渡的完整性，這些指標與人類判斷有較高的相關性相比現有指標。我們全面的實驗結果顯示，大多數影片生成器實現的組成變化不到20%，突顯了未來改進的巨大空間。我們的分析表明，當前的影片生成模型難以解釋組成變化的描述並在不同時間步驟中綜合合成各種組件。

English

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

TC-Bench：在文本到視頻和圖像到視頻生成中對時間組成性進行基準測試

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

摘要

Support