テキストからビデオ生成のための階層的時空間分離

要旨

拡散モデルはフォトリアルな画像生成において強力な能力を示しているものの、現実的で多様な動画の生成はまだ初期段階にあります。その主な理由の一つは、現在の手法が空間的内容と時間的ダイナミクスを密接に結びつけており、テキストから動画を生成する（T2V）タスクの複雑さが著しく増大していることです。本研究では、HiGenという拡散モデルベースの手法を提案し、構造レベルと内容レベルという二つの観点から動画の空間的要素と時間的要素を分離することで性能を向上させます。構造レベルでは、T2Vタスクを空間的推論と時間的推論の二段階に分解し、統一されたデノイザーを使用します。具体的には、空間的推論中にテキストを用いて空間的に一貫した事前情報を生成し、その後、時間的推論中にこれらの事前情報から時間的に一貫した動きを生成します。内容レベルでは、入力動画の内容から動きと外観の変化をそれぞれ表現する二つの微妙な手がかりを抽出します。これらの手がかりは、動画生成のためのモデルの学習を導き、柔軟な内容の変化を可能にし、時間的安定性を向上させます。この分離されたパラダイムを通じて、HiGenはこのタスクの複雑さを効果的に軽減し、意味的精度と動きの安定性を備えた現実的な動画を生成することができます。広範な実験により、HiGenが最先端のT2V手法を凌駕する優れた性能を示すことが実証されています。

English

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.

テキストからビデオ生成のための階層的時空間分離

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

要旨

Support