ChatPaper.aiChatPaper

GenCompositor:基於擴散變換器的生成式影片合成技術

GenCompositor: Generative Video Compositing with Diffusion Transformer

September 2, 2025
作者: Shuzhou Yang, Xiaoyu Li, Xiaodong Cun, Guangzhi Wang, Lingen Li, Ying Shan, Jian Zhang
cs.AI

摘要

影片合成技術結合實景拍攝素材以創造影片作品,是影片創作與電影製作中的關鍵技術。傳統流程需要大量人力投入與專家協作,導致製作周期長且人力成本高。為解決此問題,我們利用生成模型自動化這一過程,稱之為生成式影片合成。這項新任務旨在以互動方式自適應地將前景影片的身份與運動信息注入目標影片中,讓用戶能夠自定義最終影片中動態元素的大小、運動軌跡等屬性。具體而言,我們基於其內在特性設計了一種新穎的擴散變壓器(DiT)流程。為保持編輯前後目標影片的一致性,我們改進了一個輕量級的基於DiT的背景保留分支,採用掩碼標記注入技術。為繼承其他來源的動態元素,提出了一種利用全自注意力機制的DiT融合塊,並配合簡單而有效的前景增強訓練方法。此外,為根據用戶控制融合不同佈局的背景與前景影片,我們開發了一種新穎的位置嵌入方法,名為擴展旋轉位置嵌入(ERoPE)。最後,我們為這項新任務策劃了一個包含61K組影片的數據集,名為VideoComp。該數據集涵蓋完整的動態元素與高質量的目標影片。實驗證明,我們的方法有效實現了生成式影片合成,在保真度與一致性上優於現有可能的解決方案。
English
Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
PDF204September 3, 2025