判别式擴散模型作為少樣本視覺和語言學習者

摘要

擴散模型，如穩定擴散，已展現在文本到圖像生成上令人驚豔的表現。由於文本到圖像生成通常需要模型生成帶有文本提示中細緻細節和屬性的視覺概念，我們是否可以利用預先訓練的擴散模型學習的強大表示來進行識別任務，如圖像-文本匹配？為了回答這個問題，我們提出了一種新方法，稱為識別穩定擴散（DSD），將預先訓練的文本到圖像擴散模型轉換為少樣本識別學習器。我們的方法使用穩定擴散模型的交叉注意力分數來捕捉視覺和文本信息之間的相互影響，並通過基於注意力的提示學習來微調模型以執行圖像-文本匹配。通過在幾個基準數據集上將DSD與最先進的方法進行比較，我們展示了利用預先訓練的擴散模型進行識別任務的潛力，在少樣本圖像-文本匹配上取得了優異的結果。

English

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

判别式擴散模型作為少樣本視覺和語言學習者

Discriminative Diffusion Models as Few-shot Vision and Language Learners

摘要

Support