LayerFlow: レイヤー認識型ビデオ生成のための統合モデル

要旨

LayerFlowを紹介します。これはレイヤーを意識したビデオ生成のための統合ソリューションです。レイヤーごとのプロンプトを与えることで、LayerFlowは透明な前景、クリーンな背景、そしてブレンドされたシーンのビデオを生成します。また、ブレンドされたビデオを分解したり、与えられた前景に対する背景を生成するなど、多様なバリエーションもサポートします。テキストからビデオへの拡散トランスフォーマーを出発点として、異なるレイヤーのビデオをサブクリップとして整理し、レイヤー埋め込みを活用して各クリップと対応するレイヤーごとのプロンプトを区別します。これにより、前述のバリエーションを一つの統合フレームワークでシームレスにサポートします。高品質なレイヤーごとのトレーニングビデオが不足しているため、高品質なレイヤーアノテーションを持つ静止画像に対応するための多段階トレーニング戦略を設計しました。具体的には、まず低品質のビデオデータでモデルをトレーニングします。次に、モーションLoRAを調整して、モデルが静止フレームと互換性を持つようにします。その後、高品質なレイヤー画像とコピーペーストされたビデオデータの混合データでコンテンツLoRAをトレーニングします。推論時には、モーションLoRAを除去することで、望ましいレイヤーを持つ滑らかなビデオを生成します。

English

We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.