Vera：一种用于内容保持视频编辑的分层扩散模型

摘要

视频扩散模型在视频生成与编辑领域取得了显著进展。然而，内容保留仍是一个核心挑战：现有方法会重新生成每一帧像素，常常改变原本应保持不变的要素（如角色或背景场景）。我们提出Vera——一种用于内容保留视频编辑的分层扩散框架。Vera并非重新生成整个视频，而是生成一个编辑层及其对应的阿尔法遮罩，用于与源视频合成，通过设计将创意编辑与内容保留分离开来。为了促进与源视频的连贯合成，我们将文本到视频的DiT扩展为混合Transformer（Mixture-of-Transformers, MoT）架构，其中每个层拥有独立的DiT，通过联合自注意力机制进行交互。为支持Vera的训练，我们进一步构建了一个高质量分层数据集，包含精确的阿尔法遮罩、多样化的场景与动态以及视觉效果。在我们的定量基准测试和人类偏好研究中，Vera在内容保留方面优于领先的开源视频编辑模型，同时在编辑质量上保持竞争力，且仅使用了48.6万帧的分层训练数据。

English

Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.