충실도와 편집성을 갖춘 통합 잠재 디퓨전 모델을 통한 튜닝 없는 이미지 편집

초록

텍스트 기반 이미지 편집(TIE)에서 충실도(fidelity)와 편집 가능성(editability)의 균형을 맞추는 것은 매우 중요하며, 이러한 균형이 깨지면 과도한 편집 또는 편집 부족 문제가 발생하기 쉽습니다. 기존 방법들은 주로 구조 보존을 위해 주의 주입(attention injection)을 사용하고, 사전 학습된 텍스트-이미지(T2I) 모델의 내재된 텍스트 정렬 능력을 활용하여 편집 가능성을 확보하지만, 이 두 목표를 적절히 균형 있게 조절하기 위한 명시적이고 통합된 메커니즘을 제공하지 못합니다. 본 연구에서는 이러한 문제를 해결하기 위해 UnifyEdit를 제안합니다. UnifyEdit는 튜닝이 필요 없는 방법으로, 확산 잠재 공간 최적화(diffusion latent optimization)를 통해 충실도와 편집 가능성을 통합된 프레임워크 내에서 균형 있게 통합합니다. 직접적인 주의 주입과 달리, 우리는 두 가지 주의 기반 제약 조건을 개발했습니다: 구조적 충실도를 위한 자기 주의(self-attention, SA) 보존 제약과, 텍스트 정렬을 강화하여 편집 가능성을 개선하기 위한 교차 주의(cross-attention, CA) 정렬 제약입니다. 그러나 두 제약을 동시에 적용하면 그래디언트 충돌이 발생하여 한 제약이 지배적으로 작용해 과도한 편집 또는 편집 부족 문제가 발생할 수 있습니다. 이를 해결하기 위해, 우리는 이러한 제약의 영향을 동적으로 조절하는 적응형 시간 단계 스케줄러(adaptive time-step scheduler)를 도입하여 확산 잠재 공간이 최적의 균형을 달성하도록 유도합니다. 다양한 편집 작업에서 구조 보존과 텍스트 정렬 간의 견고한 균형을 달성하는 데 있어 우리 접근법의 우수성을 입증하는 광범위한 정량적 및 정성적 실험 결과를 제시하며, 이는 다른 최신 방법들을 능가하는 성능을 보여줍니다. 소스 코드는 https://github.com/CUC-MIPG/UnifyEdit에서 공개될 예정입니다.

English

Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.

충실도와 편집성을 갖춘 통합 잠재 디퓨전 모델을 통한 튜닝 없는 이미지 편집

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

초록

Support