Vera：コンテンツ保存型ビデオ編集のための階層的拡散モデル

要旨

ビデオ拡散モデルは、動画生成や編集において顕著な進歩をもたらしました。しかし、コンテンツの保存は依然として中心的な課題であり、既存手法はすべてのピクセルを再生成するため、変化すべきでない要素（キャラクターや背景シーンなど）まで変更してしまうことがあります。本稿では、コンテンツを保存しながら動画編集を行う階層型拡散フレームワーク「Vera」を提案します。Veraは動画全体を再生成するのではなく、編集レイヤーとアルファマットを生成し、それをソース動画と合成することで、クリエイティブな編集とコンテンツ保存を設計上分離します。ソース動画との一貫した合成を促進するため、テキストから動画へのDiTを拡張し、各レイヤーごとに独立したDiTを配置し、それらを結合自己注意機構（joint self-attention）で相互作用させる混合トランスフォーマー（Mixture-of-Transformers, MoT）アーキテクチャを導入します。さらに、Veraの学習を支援するために、高精度なアルファマット、多様なシーンとダイナミクス、視覚効果を備えた高品質な階層型データセットを構築しました。Veraは、486Kフレームの階層型学習データを用いて、定量的ベンチマークおよび人間による嗜好調査において、編集品質で競争力を保ちつつ、コンテンツ保存の面で主要なオープンソース動画編集モデルを上回る性能を示しました。

English

Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.