GenCompositor: 디퓨전 트랜스포머를 활용한 생성적 비디오 합성

초록

비디오 합성은 실사 영상을 결합하여 비디오 제작을 완성하는 기술로, 비디오 창작 및 영화 제작에서 핵심적인 기법으로 자리 잡고 있습니다. 기존의 작업 파이프라인은 많은 노동력과 전문가 간의 협업을 요구하며, 이로 인해 제작 주기가 길고 인력 비용이 높은 문제가 있었습니다. 이러한 문제를 해결하기 위해, 우리는 생성 모델을 활용하여 이 과정을 자동화한 생성적 비디오 합성(generative video compositing)이라는 새로운 작업을 제안합니다. 이 새로운 작업은 전경 비디오의 정체성과 움직임 정보를 대상 비디오에 적응적으로 주입하여 사용자가 최종 비디오에 추가된 동적 요소의 크기, 움직임 궤적 및 기타 속성을 사용자 정의할 수 있도록 하는 것을 목표로 합니다. 구체적으로, 우리는 Diffusion Transformer(DiT)의 고유한 특성을 기반으로 새로운 파이프라인을 설계했습니다. 편집 전후의 대상 비디오 일관성을 유지하기 위해, 마스크된 토큰 주입을 활용한 경량 DiT 기반 배경 보존 분기를 개선했습니다. 다른 소스에서 동적 요소를 상속하기 위해, 전체 자기 주의력(full self-attention)을 사용한 DiT 융합 블록을 제안하고, 간단하면서도 효과적인 전경 증강 기법을 학습에 적용했습니다. 또한, 사용자 제어에 따라 서로 다른 레이아웃을 가진 배경과 전경 비디오를 융합하기 위해, 확장 회전 위치 임베딩(Extended Rotary Position Embedding, ERoPE)이라는 새로운 위치 임베딩을 개발했습니다. 마지막으로, 우리는 VideoComp라는 새로운 작업을 위해 61K 세트의 비디오로 구성된 데이터셋을 구축했습니다. 이 데이터셋은 완전한 동적 요소와 고품질의 대상 비디오를 포함하고 있습니다. 실험 결과, 우리의 방법은 생성적 비디오 합성을 효과적으로 구현하며, 기존의 가능한 솔루션들보다 높은 충실도와 일관성을 보여주었습니다.

English

Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.

GenCompositor: 디퓨전 트랜스포머를 활용한 생성적 비디오 합성

GenCompositor: Generative Video Compositing with Diffusion Transformer

초록

Support