ChatPaper.aiChatPaper

FlowDirector:无需训练的流引导技术,实现精准文本到视频编辑

FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

June 5, 2025
作者: Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang
cs.AI

摘要

文本驱动视频编辑旨在根据自然语言指令修改视频内容。尽管近期无需训练的方法通过利用预训练扩散模型取得了进展,但它们通常依赖于基于反演的技术,将输入视频映射到潜在空间,这往往导致时间上的不一致性和结构保真度的下降。为解决这一问题,我们提出了FlowDirector,一种新颖的无反演视频编辑框架。该框架将编辑过程建模为数据空间中的直接演化,通过常微分方程(ODE)引导视频沿其固有的时空流形平滑过渡,从而保持时间连贯性和结构细节。为实现局部化和可控的编辑,我们引入了一种注意力引导的掩码机制,调节ODE速度场,在空间和时间上保护非目标区域。此外,为解决编辑不完整的问题并增强与编辑指令的语义对齐,我们提出了一种受无分类器引导启发的增强编辑策略,该策略利用多个候选流之间的差分信号,引导编辑轨迹朝向更强的语义对齐,同时不损害结构一致性。在多个基准测试上的广泛实验表明,FlowDirector在指令遵循、时间一致性和背景保留方面达到了最先进的性能,为无需反演的高效且连贯的视频编辑建立了新范式。
English
Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.
PDF20June 6, 2025