ChatPaper.aiChatPaper

FlowDirector:無需訓練的流動導向技術,實現精確文本到視頻編輯

FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

June 5, 2025
作者: Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang
cs.AI

摘要

文本驅動的視頻編輯旨在根據自然語言指令修改視頻內容。儘管最近無需訓練的方法通過利用預訓練的擴散模型取得了進展,但它們通常依賴於基於反演的技術,將輸入視頻映射到潛在空間,這往往導致時間上的不一致性和結構保真度的降低。為解決這一問題,我們提出了FlowDirector,一種新穎的無反演視頻編輯框架。我們的框架將編輯過程建模為數據空間中的直接演化,通過常微分方程(ODE)引導視頻沿其固有的時空流形平滑過渡,從而保持時間一致性和結構細節。為實現局部化和可控的編輯,我們引入了一種注意力引導的遮罩機制,調節ODE速度場,在空間和時間上保留非目標區域。此外,為解決編輯不完整並增強與編輯指令的語義對齊,我們提出了一種受無分類器引導啟發的引導增強編輯策略,該策略利用多個候選流之間的差分信號,在不損害結構一致性的情況下,引導編輯軌跡朝向更強的語義對齊。跨基準的廣泛實驗表明,FlowDirector在指令遵循、時間一致性和背景保留方面達到了最先進的性能,為無反演的高效和連貫視頻編輯建立了新範式。
English
Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.
PDF20June 6, 2025