NOVA: 페어 프리 비디오 편집을 위한 희소 제어와 밀집 합성

초록

최근 비디오 편집 모델은 인상적인 성과를 거두었으나, 대부분 여전히 대규모의 짝을 이룬 데이터셋을 필요로 합니다. 이러한 자연스럽게 정렬된 데이터 쌍을 대규모로 수집하는 것은 매우 어려운 과제로 남아 있으며, 특히 지역적 비디오 편집 데이터의 경우 핵심적인 병목 현상을 구성합니다. 기존의 우회 방법들은 글로벌 모션 제어를 통해 이미지 편집을 비디오로 전이하여 데이터 쌍이 필요 없는 비디오 편집을 구현하지만, 이러한 설계는 배경 및 시간적 일관성 유지에 어려움을 겪습니다. 본 논문에서는 짝을 이루지 않은 비디오 편집을 위한 새로운 프레임워크인 NOVA: Sparse Control & Dense Synthesis를 제안합니다. 구체적으로, 희소 분기(Sparse Branch)는 비디오 전반에 분포된 사용자 편집 키프레임을 통해 의미론적 지도를 제공하고, 조밀 분기(Dense Branch)는 원본 비디오의 모션 및 텍스처 정보를 지속적으로 통합하여 높은 정확도와 일관성을 유지합니다. 더불어, 우리는 인위적으로 저하된 비디오에 대한 학습을 통해 모델이 모션 재구성 및 시간적 일관성을 학습할 수 있는 저하 시뮬레이션 훈련 전략을 도입하여 짝을 이룬 데이터의 필요성을 제거합니다. 광범위한 실험을 통해 NOVA가 편집 정확도, 모션 보존, 시간적 일관성 측면에서 기존 접근법들을 능가함을 입증합니다.

English

Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

NOVA: 페어 프리 비디오 편집을 위한 희소 제어와 밀집 합성

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

초록

Support