LayerFlow: 레이어 인식 비디오 생성을 위한 통합 모델

초록

우리는 레이어 인식 비디오 생성을 위한 통합 솔루션인 LayerFlow를 소개합니다. LayerFlow는 레이어별 프롬프트가 주어지면 투명한 전경, 깔끔한 배경, 그리고 혼합된 장면에 대한 비디오를 생성합니다. 또한 혼합된 비디오를 분해하거나 주어진 전경에 대한 배경을 생성하는 등 다양한 변형을 지원합니다. 텍스트-투-비디오 확산 트랜스포머를 기반으로, 우리는 서로 다른 레이어의 비디오를 서브 클립으로 구성하고, 레이어 임베딩을 활용하여 각 클립과 해당 레이어별 프롬프트를 구분합니다. 이를 통해 하나의 통합 프레임워크 내에서 앞서 언급한 다양한 변형을 원활하게 지원합니다. 고품질의 레이어별 학습 비디오가 부족한 문제를 해결하기 위해, 우리는 고품질 레이어 주석이 포함된 정적 이미지를 활용할 수 있는 다단계 학습 전략을 설계했습니다. 구체적으로, 먼저 저품질 비디오 데이터로 모델을 학습시킵니다. 그런 다음, 모델이 정적 프레임과 호환되도록 모션 LoRA를 튜닝합니다. 이후, 고품질 레이어 이미지와 복사-붙여넣기된 비디오 데이터를 혼합한 이미지 데이터로 콘텐츠 LoRA를 학습시킵니다. 추론 과정에서는 모션 LoRA를 제거하여 원하는 레이어가 포함된 부드러운 비디오를 생성합니다.

English

We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

LayerFlow: 레이어 인식 비디오 생성을 위한 통합 모델

LayerFlow: A Unified Model for Layer-aware Video Generation

초록

Support