NOVA：无需配对视频编辑的稀疏控制与密集合成

摘要

当前视频编辑模型虽已取得显著成果，但多数仍依赖大规模配对数据集。然而大规模采集自然对齐的视频配对数据极具挑战性，尤其对于局部视频编辑数据而言，这已成为关键瓶颈。现有解决方案通过全局运动控制将图像编辑技术迁移至视频领域，实现无配对数据编辑，但此类设计难以保持背景与时间一致性。本文提出NOVA框架：稀疏控制与稠密合成，一种面向非配对视频编辑的新方法。具体而言，稀疏分支通过用户编辑的视频关键帧提供语义指导，稠密分支则持续融合原始视频的运动与纹理信息以维持高保真度与连贯性。此外，我们引入退化模拟训练策略，通过人工退化视频训练使模型学习运动重建与时间一致性，从而摆脱对配对数据的依赖。大量实验表明，NOVA在编辑保真度、运动保持和时间连贯性方面均优于现有方法。

English

Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

NOVA：无需配对视频编辑的稀疏控制与密集合成

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

摘要

Support