ContextFlow: 適応的コンテキスト強化によるトレーニング不要のビデオオブジェクト編集

要旨

トレーニング不要のビデオオブジェクト編集は、オブジェクトの挿入、交換、削除といった精密なオブジェクトレベルの操作を実現することを目指している。しかし、忠実度と時間的一貫性を維持する上で大きな課題に直面している。既存の手法は、U-Netアーキテクチャ向けに設計されたものが多く、主に2つの制限がある。第一に、一次ソルバーによる不正確な逆変換、第二に、粗い「ハード」な特徴置換による文脈的衝突である。これらの問題は、Diffusion Transformers（DiTs）においてさらに深刻であり、従来のレイヤー選択ヒューリスティックが不適切であるため、効果的なガイダンスが困難である。これらの制限を解決するため、我々はContextFlowを提案する。これは、DiTベースのビデオオブジェクト編集のための新しいトレーニング不要のフレームワークである。詳細には、まず高次のRectified Flowソルバーを使用して、堅牢な編集基盤を確立する。我々のフレームワークの中核は、Adaptive Context Enrichment（何を編集するかを指定するためのメカニズム）であり、文脈的衝突を解決する。特徴を置換する代わりに、並列の再構築パスと編集パスからのKey-Valueペアを連結することで、自己注意文脈を豊かにし、モデルが情報を動的に融合することを可能にする。さらに、この豊かさをどこに適用するか（どこを編集するかを指定するため）を決定するために、タスク固有の重要なレイヤーを特定するための体系的でデータ駆動型の分析を提案する。新しいGuidance Responsiveness Metricに基づいて、我々の手法は、異なるタスク（例えば、挿入、交換）に対して最も影響力のあるDiTブロックを特定し、ターゲットを絞った非常に効果的なガイダンスを可能にする。広範な実験により、ContextFlowが既存のトレーニング不要の手法を大幅に上回り、いくつかの最先端のトレーニングベースのアプローチさえも凌駕し、時間的に一貫した高忠実度の結果を提供することが示された。

English

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

ContextFlow: 適応的コンテキスト強化によるトレーニング不要のビデオオブジェクト編集

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

要旨

Support