ContextFlow：基於自適應上下文增強的免訓練視頻對象編輯

摘要

無需訓練的視頻物體編輯旨在實現精確的物體層面操作，包括物體插入、替換和刪除。然而，該領域在保持保真度和時間一致性方面面臨重大挑戰。現有方法通常針對U-Net架構設計，存在兩個主要限制：由於一階求解器導致的反演不準確，以及粗糙的“硬”特徵替換引起的上下文衝突。這些問題在擴散變壓器（DiTs）中更為嚴峻，因為先前層選擇啟發式的不適用性使得有效指導變得困難。為解決這些限制，我們引入了ContextFlow，這是一種基於DiT的無需訓練視頻物體編輯新框架。具體而言，我們首先採用高階校正流求解器建立穩健的編輯基礎。我們框架的核心是自適應上下文增強（用於指定編輯內容），這一機制解決了上下文衝突。它不替換特徵，而是通過並行重建和編輯路徑中的鍵值對連接來豐富自注意力上下文，使模型能夠動態融合信息。此外，為確定應用此增強的位置（用於指定編輯位置），我們提出了一種系統的數據驅動分析，以識別任務特定的關鍵層。基於一種新穎的指導響應性度量，我們的方法精確定位了對不同任務（如插入、替換）最具影響力的DiT塊，實現了有針對性且高效的指導。大量實驗表明，ContextFlow顯著優於現有的無需訓練方法，甚至超越了幾種基於訓練的最新方法，提供了時間一致、高保真的結果。

English

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

ContextFlow：基於自適應上下文增強的免訓練視頻對象編輯

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

摘要

Support