LayerFlow：層級感知影片生成的統一模型

摘要

我們提出了LayerFlow，這是一個針對層級感知視頻生成的統一解決方案。基於每層的提示，LayerFlow能夠生成透明前景、乾淨背景以及混合場景的視頻。它還支持多種變體，如分解混合視頻或根據給定前景生成背景，反之亦然。從一個文本到視頻的擴散變換器出發，我們將不同層級的視頻組織為子片段，並利用層級嵌入來區分每個片段及其對應的層級提示。通過這種方式，我們在一個統一框架內無縫支持上述變體。針對缺乏高質量層級訓練視頻的問題，我們設計了一個多階段訓練策略，以適應具有高質量層級註釋的靜態圖像。具體來說，我們首先使用低質量視頻數據訓練模型。然後，我們調整一個運動LoRA，使模型能夠兼容靜態幀。隨後，我們在混合了高質量分層圖像與複製粘貼視頻數據的圖像數據上訓練內容LoRA。在推理過程中，我們移除運動LoRA，從而生成具有所需層級的流暢視頻。

English

We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

LayerFlow：層級感知影片生成的統一模型

LayerFlow: A Unified Model for Layer-aware Video Generation

摘要

Support