소수 샷 시각 및 언어 학습자로서의 판별적 확산 모델

초록

Stable Diffusion과 같은 디퓨전 모델은 텍스트-이미지 생성에서 놀라운 성능을 보여주고 있습니다. 텍스트-이미지 생성은 종종 텍스트 프롬프트에 지정된 세부 사항과 속성을 가진 시각적 개념을 생성해야 하므로, 사전 학습된 디퓨전 모델이 학습한 강력한 표현을 이미지-텍스트 매칭과 같은 판별 작업에 활용할 수 있을까요? 이 질문에 답하기 위해, 우리는 사전 학습된 텍스트-이미지 디퓨전 모델을 소수 샷(few-shot) 판별 학습자로 전환하는 새로운 접근 방식인 Discriminative Stable Diffusion(DSD)을 제안합니다. 우리의 접근 방식은 Stable Diffusion 모델의 교차 주의력(cross-attention) 점수를 사용하여 시각적 정보와 텍스트 정보 간의 상호 영향을 포착하고, 주의력 기반 프롬프트 학습을 통해 모델을 미세 조정하여 이미지-텍스트 매칭을 수행합니다. 여러 벤치마크 데이터셋에서 DSD를 최신 방법들과 비교함으로써, 우리는 사전 학습된 디퓨전 모델을 판별 작업에 사용할 때 소수 샷 이미지-텍스트 매칭에서 우수한 결과를 보여줄 수 있는 잠재력을 입증합니다.

English

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

소수 샷 시각 및 언어 학습자로서의 판별적 확산 모델

Discriminative Diffusion Models as Few-shot Vision and Language Learners

초록

Support