ContextFlow:基於自適應上下文增強的免訓練視頻對象編輯
ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
September 22, 2025
作者: Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma
cs.AI
摘要
無需訓練的視頻物體編輯旨在實現精確的物體層面操作,包括物體插入、替換和刪除。然而,該領域在保持保真度和時間一致性方面面臨重大挑戰。現有方法通常針對U-Net架構設計,存在兩個主要限制:由於一階求解器導致的反演不準確,以及粗糙的“硬”特徵替換引起的上下文衝突。這些問題在擴散變壓器(DiTs)中更為嚴峻,因為先前層選擇啟發式的不適用性使得有效指導變得困難。為解決這些限制,我們引入了ContextFlow,這是一種基於DiT的無需訓練視頻物體編輯新框架。具體而言,我們首先採用高階校正流求解器建立穩健的編輯基礎。我們框架的核心是自適應上下文增強(用於指定編輯內容),這一機制解決了上下文衝突。它不替換特徵,而是通過並行重建和編輯路徑中的鍵值對連接來豐富自注意力上下文,使模型能夠動態融合信息。此外,為確定應用此增強的位置(用於指定編輯位置),我們提出了一種系統的數據驅動分析,以識別任務特定的關鍵層。基於一種新穎的指導響應性度量,我們的方法精確定位了對不同任務(如插入、替換)最具影響力的DiT塊,實現了有針對性且高效的指導。大量實驗表明,ContextFlow顯著優於現有的無需訓練方法,甚至超越了幾種基於訓練的最新方法,提供了時間一致、高保真的結果。
English
Training-free video object editing aims to achieve precise object-level
manipulation, including object insertion, swapping, and deletion. However, it
faces significant challenges in maintaining fidelity and temporal consistency.
Existing methods, often designed for U-Net architectures, suffer from two
primary limitations: inaccurate inversion due to first-order solvers, and
contextual conflicts caused by crude "hard" feature replacement. These issues
are more challenging in Diffusion Transformers (DiTs), where the unsuitability
of prior layer-selection heuristics makes effective guidance challenging. To
address these limitations, we introduce ContextFlow, a novel training-free
framework for DiT-based video object editing. In detail, we first employ a
high-order Rectified Flow solver to establish a robust editing foundation. The
core of our framework is Adaptive Context Enrichment (for specifying what to
edit), a mechanism that addresses contextual conflicts. Instead of replacing
features, it enriches the self-attention context by concatenating Key-Value
pairs from parallel reconstruction and editing paths, empowering the model to
dynamically fuse information. Additionally, to determine where to apply this
enrichment (for specifying where to edit), we propose a systematic, data-driven
analysis to identify task-specific vital layers. Based on a novel Guidance
Responsiveness Metric, our method pinpoints the most influential DiT blocks for
different tasks (e.g., insertion, swapping), enabling targeted and highly
effective guidance. Extensive experiments show that ContextFlow significantly
outperforms existing training-free methods and even surpasses several
state-of-the-art training-based approaches, delivering temporally coherent,
high-fidelity results.