TC-Bench: テキストからビデオおよび画像からビデオ生成における時間的構成性のベンチマーキング

要旨

ビデオ生成は、画像生成を超えた多くの独自の課題を抱えています。時間次元が導入されることで、フレーム間で広範なバリエーションが生じ、一貫性や連続性が損なわれる可能性があります。本研究では、単純なアクションの評価を超え、生成されたビデオが現実世界のビデオのように、時間の経過とともに新しい概念の出現とそれらの関係遷移を取り入れるべきであると主張します。ビデオ生成モデルの時間的構成性（Temporal Compositionality）を評価するために、TC-Benchというベンチマークを提案します。これは、慎重に作成されたテキストプロンプト、対応するグラウンドトゥルースビデオ、および堅牢な評価指標から成ります。プロンプトはシーンの初期状態と最終状態を明確に表現し、フレーム開発の曖昧さを効果的に減らし、遷移の完了を簡潔に評価します。さらに、プロンプトに対応する整列された現実世界のビデオを収集することで、TC-Benchの適用範囲をテキスト条件付きモデルから、生成的なフレーム補間を実行できる画像条件付きモデルに拡張します。また、生成されたビデオにおける構成要素の遷移の完全性を測定する新しい指標を開発し、これが既存の指標よりも人間の判断と有意に高い相関を示すことを実証します。我々の包括的な実験結果は、ほとんどのビデオ生成器が構成変化の20％未満しか達成できないことを明らかにし、将来の改善のための大きな余地があることを強調しています。分析によると、現在のビデオ生成モデルは、構成変化の記述を解釈し、異なる時間ステップにわたって様々な構成要素を合成するのに苦労しています。

English

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

TC-Bench: テキストからビデオおよび画像からビデオ生成における時間的構成性のベンチマーキング

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

要旨

Support