LayerFlow：面向层级感知视频生成的统一模型

摘要

我们提出了LayerFlow，一种面向分层感知视频生成的统一解决方案。给定每层提示，LayerFlow能够生成透明前景、纯净背景及融合场景的视频。此外，它还支持多种变体，如分解融合视频或为给定前景生成背景，反之亦然。基于文本到视频的扩散变换器，我们将不同层的视频组织为子片段，并利用层嵌入来区分每个片段及其对应的分层提示。通过这种方式，我们在一个统一框架内无缝支持上述多种变体。针对高质量分层训练视频的缺乏，我们设计了一种多阶段训练策略，以适应带有高质量分层标注的静态图像。具体而言，我们首先使用低质量视频数据训练模型，随后调整运动LoRA使模型兼容静态帧，接着在高质量分层图像与复制粘贴视频数据的混合数据上训练内容LoRA。在推理阶段，我们移除运动LoRA，从而生成具有所需分层的流畅视频。

English

We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

LayerFlow：面向层级感知视频生成的统一模型

LayerFlow: A Unified Model for Layer-aware Video Generation

摘要

Support