LayerFlow:面向层级感知视频生成的统一模型
LayerFlow: A Unified Model for Layer-aware Video Generation
June 4, 2025
作者: Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
cs.AI
摘要
我们提出了LayerFlow,一种面向分层感知视频生成的统一解决方案。给定每层提示,LayerFlow能够生成透明前景、纯净背景及融合场景的视频。此外,它还支持多种变体,如分解融合视频或为给定前景生成背景,反之亦然。基于文本到视频的扩散变换器,我们将不同层的视频组织为子片段,并利用层嵌入来区分每个片段及其对应的分层提示。通过这种方式,我们在一个统一框架内无缝支持上述多种变体。针对高质量分层训练视频的缺乏,我们设计了一种多阶段训练策略,以适应带有高质量分层标注的静态图像。具体而言,我们首先使用低质量视频数据训练模型,随后调整运动LoRA使模型兼容静态帧,接着在高质量分层图像与复制粘贴视频数据的混合数据上训练内容LoRA。在推理阶段,我们移除运动LoRA,从而生成具有所需分层的流畅视频。
English
We present LayerFlow, a unified solution for layer-aware video generation.
Given per-layer prompts, LayerFlow generates videos for the transparent
foreground, clean background, and blended scene. It also supports versatile
variants like decomposing a blended video or generating the background for the
given foreground and vice versa. Starting from a text-to-video diffusion
transformer, we organize the videos for different layers as sub-clips, and
leverage layer embeddings to distinguish each clip and the corresponding
layer-wise prompts. In this way, we seamlessly support the aforementioned
variants in one unified framework. For the lack of high-quality layer-wise
training videos, we design a multi-stage training strategy to accommodate
static images with high-quality layer annotations. Specifically, we first train
the model with low-quality video data. Then, we tune a motion LoRA to make the
model compatible with static frames. Afterward, we train the content LoRA on
the mixture of image data with high-quality layered images along with
copy-pasted video data. During inference, we remove the motion LoRA thus
generating smooth videos with desired layers.Summary
AI-Generated Summary