FlowDirector:無需訓練的流動導向技術,實現精確文本到視頻編輯
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
June 5, 2025
作者: Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang
cs.AI
摘要
文本驅動的視頻編輯旨在根據自然語言指令修改視頻內容。儘管最近無需訓練的方法通過利用預訓練的擴散模型取得了進展,但它們通常依賴於基於反演的技術,將輸入視頻映射到潛在空間,這往往導致時間上的不一致性和結構保真度的降低。為解決這一問題,我們提出了FlowDirector,一種新穎的無反演視頻編輯框架。我們的框架將編輯過程建模為數據空間中的直接演化,通過常微分方程(ODE)引導視頻沿其固有的時空流形平滑過渡,從而保持時間一致性和結構細節。為實現局部化和可控的編輯,我們引入了一種注意力引導的遮罩機制,調節ODE速度場,在空間和時間上保留非目標區域。此外,為解決編輯不完整並增強與編輯指令的語義對齊,我們提出了一種受無分類器引導啟發的引導增強編輯策略,該策略利用多個候選流之間的差分信號,在不損害結構一致性的情況下,引導編輯軌跡朝向更強的語義對齊。跨基準的廣泛實驗表明,FlowDirector在指令遵循、時間一致性和背景保留方面達到了最先進的性能,為無反演的高效和連貫視頻編輯建立了新範式。
English
Text-driven video editing aims to modify video content according to natural
language instructions. While recent training-free approaches have made progress
by leveraging pre-trained diffusion models, they typically rely on
inversion-based techniques that map input videos into the latent space, which
often leads to temporal inconsistencies and degraded structural fidelity. To
address this, we propose FlowDirector, a novel inversion-free video editing
framework. Our framework models the editing process as a direct evolution in
data space, guiding the video via an Ordinary Differential Equation (ODE) to
smoothly transition along its inherent spatiotemporal manifold, thereby
preserving temporal coherence and structural details. To achieve localized and
controllable edits, we introduce an attention-guided masking mechanism that
modulates the ODE velocity field, preserving non-target regions both spatially
and temporally. Furthermore, to address incomplete edits and enhance semantic
alignment with editing instructions, we present a guidance-enhanced editing
strategy inspired by Classifier-Free Guidance, which leverages differential
signals between multiple candidate flows to steer the editing trajectory toward
stronger semantic alignment without compromising structural consistency.
Extensive experiments across benchmarks demonstrate that FlowDirector achieves
state-of-the-art performance in instruction adherence, temporal consistency,
and background preservation, establishing a new paradigm for efficient and
coherent video editing without inversion.