ContextFlow:通过自适应上下文增强实现免训练的视频对象编辑
ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
September 22, 2025
作者: Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma
cs.AI
摘要
无需训练的视频对象编辑旨在实现精确的对象级操作,包括对象插入、替换和删除。然而,在保持保真度和时间一致性方面,它面临着重大挑战。现有方法通常为U-Net架构设计,存在两个主要局限:由于一阶求解器导致的逆变换不准确,以及粗糙的“硬”特征替换引发的上下文冲突。这些问题在扩散变换器(DiTs)中更为棘手,因为先前层选择启发式方法的不适用性使得有效指导变得困难。为解决这些局限,我们引入了ContextFlow,一种基于DiT的无需训练视频对象编辑新框架。具体而言,我们首先采用高阶校正流求解器建立稳健的编辑基础。框架的核心是自适应上下文增强机制(用于指定编辑内容),该机制通过并行重建与编辑路径中的键值对连接,丰富自注意力上下文,使模型能够动态融合信息,而非直接替换特征。此外,为确定应用此增强的位置(用于指定编辑位置),我们提出了一种系统化、数据驱动的分析,以识别任务关键层。基于新颖的指导响应度度量,我们的方法精确定位了不同任务(如插入、替换)中最具影响力的DiT模块,实现了精准且高效的指导。大量实验表明,ContextFlow显著超越了现有的无需训练方法,甚至优于多种基于训练的最先进技术,提供了时间连贯、高保真的编辑结果。
English
Training-free video object editing aims to achieve precise object-level
manipulation, including object insertion, swapping, and deletion. However, it
faces significant challenges in maintaining fidelity and temporal consistency.
Existing methods, often designed for U-Net architectures, suffer from two
primary limitations: inaccurate inversion due to first-order solvers, and
contextual conflicts caused by crude "hard" feature replacement. These issues
are more challenging in Diffusion Transformers (DiTs), where the unsuitability
of prior layer-selection heuristics makes effective guidance challenging. To
address these limitations, we introduce ContextFlow, a novel training-free
framework for DiT-based video object editing. In detail, we first employ a
high-order Rectified Flow solver to establish a robust editing foundation. The
core of our framework is Adaptive Context Enrichment (for specifying what to
edit), a mechanism that addresses contextual conflicts. Instead of replacing
features, it enriches the self-attention context by concatenating Key-Value
pairs from parallel reconstruction and editing paths, empowering the model to
dynamically fuse information. Additionally, to determine where to apply this
enrichment (for specifying where to edit), we propose a systematic, data-driven
analysis to identify task-specific vital layers. Based on a novel Guidance
Responsiveness Metric, our method pinpoints the most influential DiT blocks for
different tasks (e.g., insertion, swapping), enabling targeted and highly
effective guidance. Extensive experiments show that ContextFlow significantly
outperforms existing training-free methods and even surpasses several
state-of-the-art training-based approaches, delivering temporally coherent,
high-fidelity results.