Vera: 내용 보존 비디오 편집을 위한 계층적 확산 모델

초록

비디오 확산 모델은 비디오 생성 및 편집 분야에서 놀라운 진전을 가능하게 했습니다. 그러나 콘텐츠 보존은 여전히 핵심 과제로 남아 있습니다. 기존 방법은 모든 픽셀을 재생성하며, 캐릭터나 배경 장면처럼 변경되지 않아야 할 요소를 종종 변경합니다. 본 연구에서는 콘텐츠 보존 비디오 편집을 위한 계층적 확산 프레임워크인 Vera를 소개합니다. Vera는 전체 비디오를 재생성하는 대신, 편집 레이어와 함께 소스 비디오와 합성하기 위한 알파 매트를 생성하여, 설계적으로 창의적 편집과 콘텐츠 보존을 분리합니다. 소스 비디오와의 일관된 합성을 촉진하기 위해, 텍스트-투-비디오 DiT를 Mixture-of-Transformers(MoT) 아키텍처로 확장하였으며, 각 레이어에 대해 별도의 DiT가 공동 자기 주의 메커니즘을 통해 상호작용하도록 구성했습니다. Vera의 훈련을 지원하기 위해, 정확한 알파 매트, 다양한 장면과 역동성, 시각적 효과를 포함한 고품질의 계층적 데이터셋을 추가로 구축했습니다. 정량적 벤치마크와 인간 선호도 연구에서 Vera는 486K 프레임의 계층적 훈련 데이터를 사용하여, 콘텐츠 보존 측면에서 선도적인 오픈소스 비디오 편집 모델을 능가하면서도 편집 품질에서 경쟁력을 유지합니다.

English

Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.