テキストからビデオ生成におけるトレーニング不要のガイダンス：マルチモーダル計画と構造化ノイズ初期化によるアプローチ

要旨

最近のテキストからビデオ（T2V）拡散モデルの進展により、生成されるビデオの視覚的品質が大幅に向上しました。しかし、最近のT2Vモデルでさえ、テキスト記述を正確に追従することは依然として困難であり、特にプロンプトが空間レイアウトや物体の軌跡の正確な制御を要求する場合に顕著です。最近の研究では、T2Vモデルにレイアウトガイダンスを使用するアプローチが取られており、推論時に注意マップの微調整や反復的な操作が必要となります。これによりメモリ要件が大幅に増加し、大規模なT2Vモデルをバックボーンとして採用することが難しくなっています。この問題に対処するため、我々はマルチモーダル計画と構造化ノイズ初期化に基づく、トレーニング不要のT2V生成ガイダンス手法であるVideo-MSGを提案します。Video-MSGは3つのステップで構成され、最初の2つのステップでは、Video-MSGは最終ビデオの詳細な時空間計画であるVideo Sketchを作成し、背景、前景、および物体の軌跡をドラフトビデオフレームの形で指定します。最後のステップでは、Video-MSGはノイズ反転とノイズ除去を通じて、Video Sketchを使用して下流のT2V拡散モデルをガイドします。特に、Video-MSGは推論時に追加のメモリを必要とする微調整や注意操作を必要としないため、大規模なT2Vモデルを容易に採用できます。Video-MSGは、人気のあるT2V生成ベンチマーク（T2VCompBenchおよびVBench）において、複数のT2Vバックボーン（VideoCrafter2およびCogVideoX-5B）を用いてテキストアラインメントを強化する効果を実証しています。我々は、ノイズ反転比率、異なる背景生成器、背景物体検出、および前景物体セグメンテーションに関する包括的なアブレーション研究を提供します。

English

Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

テキストからビデオ生成におけるトレーニング不要のガイダンス：マルチモーダル計画と構造化ノイズ初期化によるアプローチ

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

要旨

Support