델타-어댑터: 단일 쌍 지도 학습을 통한 확장 가능한 예시 기반 이미지 편집

초록

예시 기반 이미지 편집은 소스-타겟 이미지 쌍으로 정의된 변환을 새로운 질의 이미지에 적용한다. 기존 방법은 쌍-쌍 감독 패러다임에 의존하며, 동일한 편집 의미를 공유하는 두 개의 이미지 쌍이 타겟 변환을 학습하기 위해 필요하다. 이러한 제약으로 인해 학습 데이터를 대규모로 구축하기 어렵고, 다양한 편집 유형에 대한 일반화가 제한된다. 본 논문에서는 단일 쌍 감독 하에 전이 가능한 편집 의미를 학습하는 방법인 Delta-Adapter를 제안하며, 텍스트 안내가 필요하지 않다. 모델에 예시 쌍을 직접 노출시키는 대신, 사전 학습된 시각 인코더를 활용하여 두 이미지 간의 시각적 변환을 인코딩하는 의미 델타를 추출한다. 이 의미 델타는 Perceiver 기반 어댑터를 통해 사전 학습된 이미지 편집 모델에 주입된다. 타겟 이미지는 모델에 직접 보이지 않으므로 예측 대상 역할을 할 수 있으며, 추가 예시 쌍 없이 단일 쌍 감독을 가능하게 한다. 이러한 구성은 기존의 대규모 편집 데이터셋을 학습에 활용할 수 있게 해준다. 충실한 변환 전이를 더욱 촉진하기 위해, 생성된 출력의 의미 변화를 예시 쌍에서 추출된 실제 의미 델타와 정렬하는 의미 델타 일관성 손실을 도입한다. 광범위한 실험을 통해 Delta-Adapter가 기존 네 가지 강력한 기준 모델에 비해 기존 편집 작업에서 편집 정확도와 콘텐츠 일관성을 일관되게 개선할 뿐만 아니라, 새로운 편집 작업에 대해서도 더 효과적으로 일반화함을 입증한다. 코드는 https://delta-adapter.github.io에서 공개될 예정이다.

English

Exemplar-based image editing applies a transformation defined by a source-target image pair to a new query image. Existing methods rely on a pair-of-pairs supervision paradigm, requiring two image pairs sharing the same edit semantics to learn the target transformation. This constraint makes training data difficult to curate at scale and limits generalization across diverse edit types. We propose Delta-Adapter, a method that learns transferable editing semantics under single-pair supervision, requiring no textual guidance. Rather than directly exposing the exemplar pair to the model, we leverage a pre-trained vision encoder to extract a semantic delta that encodes the visual transformation between the two images. This semantic delta is injected into a pre-trained image editing model via a Perceiver-based adapter. Since the target image is never directly visible to the model, it can serve as the prediction target, enabling single-pair supervision without requiring additional exemplar pairs. This formulation allows us to leverage existing large-scale editing datasets for training. To further promote faithful transformation transfer, we introduce a semantic delta consistency loss that aligns the semantic change of the generated output with the ground-truth semantic delta extracted from the exemplar pair. Extensive experiments demonstrate that Delta-Adapter consistently improves both editing accuracy and content consistency over four strong baselines on seen editing tasks, while also generalizing more effectively to unseen editing tasks. Code will be available at https://delta-adapter.github.io.

델타-어댑터: 단일 쌍 지도 학습을 통한 확장 가능한 예시 기반 이미지 편집

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

초록

Support