LayerFlow:層級感知影片生成的統一模型
LayerFlow: A Unified Model for Layer-aware Video Generation
June 4, 2025
作者: Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
cs.AI
摘要
我們提出了LayerFlow,這是一個針對層級感知視頻生成的統一解決方案。基於每層的提示,LayerFlow能夠生成透明前景、乾淨背景以及混合場景的視頻。它還支持多種變體,如分解混合視頻或根據給定前景生成背景,反之亦然。從一個文本到視頻的擴散變換器出發,我們將不同層級的視頻組織為子片段,並利用層級嵌入來區分每個片段及其對應的層級提示。通過這種方式,我們在一個統一框架內無縫支持上述變體。針對缺乏高質量層級訓練視頻的問題,我們設計了一個多階段訓練策略,以適應具有高質量層級註釋的靜態圖像。具體來說,我們首先使用低質量視頻數據訓練模型。然後,我們調整一個運動LoRA,使模型能夠兼容靜態幀。隨後,我們在混合了高質量分層圖像與複製粘貼視頻數據的圖像數據上訓練內容LoRA。在推理過程中,我們移除運動LoRA,從而生成具有所需層級的流暢視頻。
English
We present LayerFlow, a unified solution for layer-aware video generation.
Given per-layer prompts, LayerFlow generates videos for the transparent
foreground, clean background, and blended scene. It also supports versatile
variants like decomposing a blended video or generating the background for the
given foreground and vice versa. Starting from a text-to-video diffusion
transformer, we organize the videos for different layers as sub-clips, and
leverage layer embeddings to distinguish each clip and the corresponding
layer-wise prompts. In this way, we seamlessly support the aforementioned
variants in one unified framework. For the lack of high-quality layer-wise
training videos, we design a multi-stage training strategy to accommodate
static images with high-quality layer annotations. Specifically, we first train
the model with low-quality video data. Then, we tune a motion LoRA to make the
model compatible with static frames. Afterward, we train the content LoRA on
the mixture of image data with high-quality layered images along with
copy-pasted video data. During inference, we remove the motion LoRA thus
generating smooth videos with desired layers.