FlowDirector: 정밀한 텍스트-투-비디오 편집을 위한 학습 없이 가능한 플로우 스티어링

초록

텍스트 기반 비디오 편집은 자연어 지시에 따라 비디오 콘텐츠를 수정하는 것을 목표로 한다. 최근 훈련이 필요 없는 접근법들이 사전 훈련된 확산 모델을 활용하여 진전을 이루었지만, 이러한 방법들은 일반적으로 입력 비디오를 잠재 공간으로 매핑하는 역변환 기반 기술에 의존하며, 이는 종종 시간적 불일치와 구조적 충실도의 저하를 초래한다. 이를 해결하기 위해, 우리는 역변환이 필요 없는 새로운 비디오 편집 프레임워크인 FlowDirector를 제안한다. 우리의 프레임워크는 편집 과정을 데이터 공간에서의 직접적인 진화로 모델링하며, Ordinary Differential Equation (ODE)을 통해 비디오를 안내하여 고유한 시공간 다양체를 따라 부드럽게 전환함으로써 시간적 일관성과 구조적 세부 사항을 보존한다. 지역적이고 제어 가능한 편집을 달성하기 위해, 우리는 ODE 속도장을 조절하는 주의 기반 마스킹 메커니즘을 도입하여 비대상 영역을 공간적 및 시간적으로 보존한다. 또한, 불완전한 편집을 해결하고 편집 지시와의 의미적 정렬을 강화하기 위해, Classifier-Free Guidance에서 영감을 받은 지도 강화 편집 전략을 제시한다. 이 전략은 여러 후보 흐름 간의 차이 신호를 활용하여 구조적 일관성을 훼손하지 않으면서도 더 강력한 의미적 정렬을 향해 편집 궤적을 조종한다. 벤치마크를 통한 광범위한 실험은 FlowDirector가 지시 준수, 시간적 일관성, 배경 보존 측면에서 최첨단 성능을 달성함을 보여주며, 역변환 없이도 효율적이고 일관된 비디오 편집을 위한 새로운 패러다임을 확립한다.

English

Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.