NOVA: ペア不要の動画編集のためのスパース制御と高密度合成

要旨

近年、動画編集モデルは目覚ましい成果を上げているが、その多くは依然として大規模な教師データを必要とする。このような自然に整列したデータを大規模に収集することは極めて困難であり、特に局所的な動画編集データにおいては重大なボトルネックとなっている。既存の回避策として、画像編集技術を大域的なモーション制御により動画に転嫁する手法が提案されているが、こうした設計では背景や時間的な一貫性の維持に課題がある。本論文では、非対応動画編集のための新しいフレームワーク「NOVA: Sparse Control & Dense Synthesis」を提案する。具体的には、スパース分岐が動画全体に分散したユーザー編集キーフレームを通じて意味的ガイダンスを提供し、デンス分岐が元動画からモーションとテクスチャ情報を連続的に取り込むことで、高忠実度と一貫性を維持する。さらに、擬似的に劣化させた動画で学習させることで、モデルがモーション再構成と時間的一貫性を学習できる劣化シミュレーション訓練戦略を導入し、教師データの必要性を排除した。大規模な実験により、NOVAが編集の忠実度、モーション保存性、時間的一貫性において既存手法を凌駕することを実証する。

English

Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

NOVA: ペア不要の動画編集のためのスパース制御と高密度合成

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

要旨

Support