CARE-Edit: コンテキスト画像編集のための条件対応エキスパートルーティング

要旨

統合的な拡散モデル編集器は、多様なタスクに対して固定された共有バックボーンに依存することが多く、タスク間の干渉や異種の要求（例：局所的 vs 全局的、意味的 vs 測光的）への適応の悪さに悩まされている。特に広く使われているControlNetやOmniControlの変種では、複数の条件付け信号（例：テキスト、マスク、参照画像）を静的な結合や加法的アダプターで統合しており、矛盾するモダリティを動的に優先または抑制できない。このため、マスク境界を越えた色滲み、アイデンティティやスタイルのドリフト、複数条件入力時の予測不能な動作などのアーティファクトが生じる。この問題に対処するため、我々はモデルの計算を特定の編集能力に合わせるCondition-Aware Routing of Experts (CARE-Edit)を提案する。中核となる軽量な潜在注意ルーターは、マルチモーダル条件と拡散タイムステップに基づいて、符号化された拡散トークンを4つの専門家（テキスト、マスク、参照、ベース）に割り当てる：(i) Mask Repaintモジュールがまず粗いユーザー定義マスクを精密な空間ガイダンスのために修正する；(ii) ルーターはスパースなtop-K選択を適用し、最も関連性の高い専門家への計算を動的に割り当てる；(iii) Latent Mixtureモジュールが専門家の出力を統合し、意味的、空間的、様式的情報をベース画像に首尾一貫して統合する。実験により、CARE-Editが消去、置換、テキスト駆動編集、スタイル転送などの文脈的編集タスクで強力な性能を発揮することを検証した。実証分析はさらに、専門家のタスク特異的な振る舞いを明らかにし、複数条件の衝突を緩和する動的で条件認識的な処理の重要性を示している。

English

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

CARE-Edit: コンテキスト画像編集のための条件対応エキスパートルーティング

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

要旨

Support