CARE-Edit:基於條件感知的專家路由機制用於上下文圖像編輯
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
March 9, 2026
作者: Yucheng Wang, Zedong Wang, Yuetong Wu, Yue Ma, Dan Xu
cs.AI
摘要
統一擴散編輯器通常依賴固定的共享骨幹網絡處理多樣任務,存在任務干擾與異質需求適應性不足的問題(例如局部與全局編輯、語義與光度調整)。現行主流的ControlNet與OmniControl變體通過靜態拼接或加法適配器融合多種條件信號(如文本、遮罩、參考圖),但無法動態調控衝突模態的優先級,導致跨遮罩邊界的色彩滲透、身份或風格漂移,以及多條件輸入下的不可控行為。為此,我們提出條件感知專家路由機制(CARE-Edit),將模型計算與特定編輯能力精準對齊。其核心在於通過輕量級潛在注意力路由器,根據多模態條件與擴散時間步將編碼後的擴散標記分配至四個專項專家模塊——文本、遮罩、參考圖與基礎模型:(i)遮罩重繪模塊首先優化用戶定義的粗糙遮罩以提供精確空間引導;(ii)路由器採用稀疏Top-K選擇機制動態分配計算資源至最相關專家;(iii)潛在混合模塊隨後融合專家輸出,將語義、空間與風格信息連貫整合至基礎圖像。實驗驗證CARE-Edit在上下文編輯任務(包括擦除、替換、文本驅動編輯與風格遷移)中的卓越表現。實證分析進一步揭示了專項專家的任務特異性行為,彰顯動態條件感知處理對於緩解多條件衝突的關鍵作用。
English
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.