GenCompositor:基于扩散变换器的生成式视频合成
GenCompositor: Generative Video Compositing with Diffusion Transformer
September 2, 2025
作者: Shuzhou Yang, Xiaoyu Li, Xiaodong Cun, Guangzhi Wang, Lingen Li, Ying Shan, Jian Zhang
cs.AI
摘要
视频合成技术通过整合实拍镜头来创作视频作品,是视频制作与电影生产中的关键技术。传统流程需耗费大量人力,依赖专家协作,导致制作周期长、人力成本高。为解决这一问题,我们采用生成模型自动化这一过程,称之为生成式视频合成。这一新任务旨在以交互方式自适应地将前景视频的身份与运动信息注入目标视频,使用户能够自定义最终视频中动态元素的大小、运动轨迹等属性。具体而言,我们基于其内在特性设计了一种新颖的扩散变换器(DiT)流程。为保持编辑前后目标视频的一致性,我们改进了一个轻量级的基于DiT的背景保留分支,采用掩码标记注入技术。为继承其他来源的动态元素,提出了一个利用全自注意力机制的DiT融合模块,并辅以简单有效的前景增强训练方法。此外,为根据用户控制融合不同布局的背景与前景视频,我们开发了一种新型位置编码,称为扩展旋转位置编码(ERoPE)。最后,我们为这一新任务构建了一个包含61K组视频的数据集,命名为VideoComp,该数据集包含完整的动态元素及高质量的目标视频。实验表明,我们的方法有效实现了生成式视频合成,在保真度与一致性上均优于现有可能的解决方案。
English
Video compositing combines live-action footage to create video production,
serving as a crucial technique in video creation and film production.
Traditional pipelines require intensive labor efforts and expert collaboration,
resulting in lengthy production cycles and high manpower costs. To address this
issue, we automate this process with generative models, called generative video
compositing. This new task strives to adaptively inject identity and motion
information of foreground video to the target video in an interactive manner,
allowing users to customize the size, motion trajectory, and other attributes
of the dynamic elements added in final video. Specifically, we designed a novel
Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To
maintain consistency of the target video before and after editing, we revised a
light-weight DiT-based background preservation branch with masked token
injection. As to inherit dynamic elements from other sources, a DiT fusion
block is proposed using full self-attention, along with a simple yet effective
foreground augmentation for training. Besides, for fusing background and
foreground videos with different layouts based on user control, we developed a
novel position embedding, named Extended Rotary Position Embedding (ERoPE).
Finally, we curated a dataset comprising 61K sets of videos for our new task,
called VideoComp. This data includes complete dynamic elements and high-quality
target videos. Experiments demonstrate that our method effectively realizes
generative video compositing, outperforming existing possible solutions in
fidelity and consistency.