ContextFlow: 적응형 컨텍스트 강화를 통한 학습 없이 가능한 비디오 객체 편집

초록

학습 없이 동영상 객체 편집을 수행하는 것은 객체 삽입, 교체, 삭제를 포함한 정밀한 객체 수준 조작을 목표로 합니다. 그러나 이러한 접근법은 충실도와 시간적 일관성을 유지하는 데 있어 상당한 어려움에 직면해 있습니다. 기존 방법들은 주로 U-Net 아키텍처를 위해 설계되었으며, 두 가지 주요 한계를 가지고 있습니다: 1차 솔버로 인한 부정확한 역변환과, 원시적인 "하드" 특징 교체로 인한 문맥적 충돌입니다. 이러한 문제는 Diffusion Transformer(DiT)에서 더욱 도전적인데, 기존의 레이어 선택 휴리스틱이 적합하지 않아 효과적인 가이던스를 적용하기 어렵기 때문입니다. 이러한 한계를 해결하기 위해, 우리는 DiT 기반 동영상 객체 편집을 위한 새로운 학습 없는 프레임워크인 ContextFlow를 제안합니다. 구체적으로, 우리는 먼저 고차 Rectified Flow 솔버를 사용하여 견고한 편집 기반을 구축합니다. 우리 프레임워크의 핵심은 Adaptive Context Enrichment(무엇을 편집할지 지정)으로, 문맥적 충돌을 해결하는 메커니즘입니다. 이는 특징을 교체하는 대신, 병렬 재구성 및 편집 경로에서 Key-Value 쌍을 연결하여 self-attention 문맥을 풍부하게 함으로써 모델이 정보를 동적으로 융합할 수 있도록 합니다. 또한, 이러한 풍부화를 어디에 적용할지(어디를 편집할지 지정) 결정하기 위해, 우리는 작업별 핵심 레이어를 식별하기 위한 체계적이고 데이터 기반의 분석을 제안합니다. 새로운 Guidance Responsiveness Metric을 기반으로, 우리의 방법은 삽입, 교체 등 다양한 작업에 대해 가장 영향력 있는 DiT 블록을 정확히 찾아내어, 표적화된 고효율 가이던스를 가능하게 합니다. 광범위한 실험 결과, ContextFlow는 기존의 학습 없는 방법들을 크게 능가하며, 심지어 여러 최첨단 학습 기반 접근법을 뛰어넘는, 시간적으로 일관되고 고충실도의 결과를 제공함을 보여줍니다.

English

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

ContextFlow: 적응형 컨텍스트 강화를 통한 학습 없이 가능한 비디오 객체 편집

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

초록

Support