判别性扩散模型作为少样本视觉和语言学习者

摘要

扩散模型，如稳定扩散，在文本到图像生成方面展现出令人难以置信的性能。由于文本到图像生成通常需要模型生成文本提示中指定的带有细粒度细节和属性的视觉概念，我们是否可以利用预训练扩散模型学到的强大表示来进行诸如图像-文本匹配之类的判别任务？为了回答这个问题，我们提出了一种新颖的方法，称为判别稳定扩散（DSD），它将预训练的文本到图像扩散模型转化为少样本判别学习器。我们的方法利用稳定扩散模型的交叉注意力分数来捕捉视觉和文本信息之间的相互影响，并通过基于注意力的提示学习来微调模型以执行图像-文本匹配。通过在几个基准数据集上将DSD与最先进的方法进行比较，我们展示了利用预训练扩散模型进行判别任务的潜力，取得了在少样本图像-文本匹配上的优越结果。

English

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

判别性扩散模型作为少样本视觉和语言学习者

Discriminative Diffusion Models as Few-shot Vision and Language Learners

摘要

Support