CARE-Edit: 컨텍스트 기반 이미지 편집을 위한 조건 인식 전문가 라우팅

초록

통합 디퓨전 편집기는 다양한 작업에 고정된 공유 백본을 사용함으로써 작업 간섭과 이질적 요구사항(예: 지역적 vs 전역적, 의미론적 vs 광도적)에 대한 낮은 적응력을 겪습니다. 특히 널리 사용되는 ControlNet 및 OmniControl 변종들은 정적 연결(concatenation) 또는 가법 어댑터(additive adapters)를 통해 여러 조건 신호(예: 텍스트, 마스크, 참조 이미지)를 결합하는데, 이는 상충되는 모달리티를 동적으로 우선시하거나 억제할 수 없어 마스크 경계를 넘는 색상 번짐(color bleeding), 정체성 또는 스타일 드리프트(drift), 다중 조건 입력 시 예측 불가능한 동작과 같은 아티팩트를 초래합니다. 이를 해결하기 위해 우리는 모델 계산을 특정 편집 능력과 정렬하는 조건 인식 전문가 라우팅(Condition-Aware Routing of Experts, CARE-Edit)을 제안합니다. 핵심적으로, 경량의 잠재 주의력 라우터(latent-attention router)는 다중 모달 조건과 디퓨전 타임스텝에 따라 인코딩된 디퓨전 토큰을 네 명의 전문가(Text, Mask, Reference, Base)에게 할당합니다: (i) 마스크 리페인트(Mask Repaint) 모듈은 먼저 정확한 공간적 guidance를 위해 사용자가 정의한 coarse 마스크를 개선합니다; (ii) 라우터는 sparse top-K 선택을 적용하여 가장 관련성 높은 전문가에게 계산을 동적으로 할당합니다; (iii) 잠재 혼합(Latent Mixture) 모듈은 이후 전문가들의 출력을 융합하여 의미론적, 공간적, 스타일 정보를 기본 이미지에 일관성 있게 통합합니다. 실험을 통해 CARE-Edit이 삭제, 대체, 텍스트 기반 편집, 스타일 변환을 포함한 문맥 기반 편집 작업에서 강력한 성능을 보임을 입증했습니다. 실증 분석은 더 나아가 전문가들의 작업 특화적 행동을 보여주며, 다중 조건 충돌을 완화하기 위한 동적이고 조건 인식적인 처리의 중요성을 부각합니다.

English

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

CARE-Edit: 컨텍스트 기반 이미지 편집을 위한 조건 인식 전문가 라우팅

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

초록

Support