GenCompositor: 拡散Transformerを用いた生成的ビデオ合成

要旨

ビデオコンポジットは、実写映像を組み合わせて映像制作を行う技術であり、映像制作や映画制作において重要な役割を果たしています。従来の制作プロセスでは、多大な労力と専門家の協力が必要であり、制作期間が長く、人件費も高くなります。この問題を解決するため、我々は生成モデルを用いてこのプロセスを自動化し、生成型ビデオコンポジットと呼んでいます。この新しいタスクは、前景ビデオのID情報とモーション情報をターゲットビデオに適応的に注入し、ユーザーが最終的なビデオに追加する動的要素のサイズやモーショントラジェクトリなどの属性をカスタマイズできるようにすることを目指しています。具体的には、その本質的な特性に基づいて、新しいDiffusion Transformer（DiT）パイプラインを設計しました。編集前後のターゲットビデオの一貫性を保つために、マスクされたトークン注入を用いた軽量なDiTベースの背景保持ブランチを修正しました。他のソースから動的要素を継承するために、完全な自己注意を用いたDiT融合ブロックを提案し、シンプルで効果的な前景拡張をトレーニングに使用しました。さらに、ユーザーの制御に基づいて異なるレイアウトの背景と前景ビデオを融合するために、Extended Rotary Position Embedding（ERoPE）という新しい位置埋め込みを開発しました。最後に、我々はVideoCompという新しいタスクのために、61Kセットのビデオを含むデータセットをキュレーションしました。このデータには、完全な動的要素と高品質のターゲットビデオが含まれています。実験結果は、我々の方法が生成型ビデオコンポジットを効果的に実現し、忠実度と一貫性において既存の可能なソリューションを上回ることを示しています。

English

Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.

GenCompositor: 拡散Transformerを用いた生成的ビデオ合成

GenCompositor: Generative Video Compositing with Diffusion Transformer

要旨

Support