少数ショット視覚と言語学習者としての識別拡散モデル

要旨

Stable Diffusionのような拡散モデルは、テキストから画像生成において驚異的な性能を示しています。テキストから画像生成では、テキストプロンプトで指定された細かな詳細や属性を持つ視覚的概念を生成する必要があるため、事前学習された拡散モデルが獲得した強力な表現を、画像とテキストのマッチングといった識別タスクに活用できるでしょうか？この疑問に答えるため、我々はDiscriminative Stable Diffusion（DSD）という新しいアプローチを提案します。これは、事前学習されたテキストから画像生成の拡散モデルを、少数ショットの識別学習器に変換するものです。我々のアプローチでは、Stable Diffusionモデルのクロスアテンションスコアを用いて、視覚情報とテキスト情報の相互影響を捉え、アテンションベースのプロンプト学習を通じてモデルを微調整し、画像とテキストのマッチングを行います。いくつかのベンチマークデータセットにおいて、DSDを最先端の手法と比較することで、事前学習された拡散モデルを識別タスクに使用する可能性を示し、少数ショットの画像とテキストのマッチングにおいて優れた結果を得ました。

English

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

少数ショット視覚と言語学習者としての識別拡散モデル

Discriminative Diffusion Models as Few-shot Vision and Language Learners

要旨

Support