RelationAdapter: 디퓨전 트랜스포머를 활용한 시각적 관계 학습 및 전이

초록

대규모 언어 모델(LLMs)의 인-컨텍스트 학습 메커니즘에서 영감을 받아, 일반화 가능한 시각적 프롬프트 기반 이미지 편집의 새로운 패러다임이 등장하고 있습니다. 기존의 단일 참조 방식은 주로 스타일이나 외관 조정에 초점을 맞추며, 비강체 변환에는 어려움을 겪습니다. 이러한 한계를 해결하기 위해, 우리는 소스-타겟 이미지 쌍을 활용하여 콘텐츠 인식 편집 의도를 추출하고 새로운 쿼리 이미지에 전달하는 방법을 제안합니다. 이를 위해, 우리는 Diffusion Transformer(DiT) 기반 모델이 최소한의 예제로부터 시각적 변환을 효과적으로 포착하고 적용할 수 있도록 하는 경량 모듈인 RelationAdapter를 소개합니다. 또한, 모델의 일반화 및 적응 능력을 시각적 프롬프트 기반 시나리오에서 평가하기 위해 218가지 다양한 편집 작업으로 구성된 포괄적인 데이터셋인 Relation252K를 제안합니다. Relation252K에 대한 실험 결과, RelationAdapter는 모델의 편집 의도 이해 및 전달 능력을 크게 향상시켜 생성 품질과 전반적인 편집 성능에서 상당한 개선을 이끌어냄을 보여줍니다.

English

Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

RelationAdapter: 디퓨전 트랜스포머를 활용한 시각적 관계 학습 및 전이

RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

초록

Support