1枚の画像は複数の言葉に値する：マルチコンセプトプロンプト学習を用いたオブジェクトレベルの概念学習

要旨

テクスチャル・インバージョンは、プロンプト学習手法の一つであり、新しい「単語」のための単一の埋め込みを学習することで、画像のスタイルや外観を表現し、それを自然言語文に統合して新たな合成画像を生成することを可能にします。しかし、個々の概念の埋め込みが得られる場合でも、一つのシーン内で複数のオブジェクトレベルの概念を識別し統合することは大きな課題となります。これは私たちの実証実験によっても確認されています。この課題に対処するため、私たちはマルチコンセプト・プロンプト学習（MCPL）のフレームワークを導入し、単一の文-画像ペアから複数の新しい「単語」を同時に学習します。単語と概念の関連性の精度を向上させるために、以下の3つの正則化手法を提案します：関連領域に学習を集中させるためのアテンション・マスキング（AttnMask）、異なる概念の埋め込みを分離するためのプロンプト対比損失（PromptCL）、そして新しい「単語」を既知の単語と関連付けるための形容詞バインド（Bind adj.）です。私たちは、多様な画像を用いた画像生成、編集、およびアテンションの可視化を通じて評価を行いました。広範な定量的比較により、私たちの手法がより意味的に分離された概念を学習し、単語と概念の関連性を強化できることが示されています。さらに、この新しいオブジェクトレベルの概念学習タスクに特化した新しいデータセットと評価プロトコルを導入しました。

English

Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.

1枚の画像は複数の言葉に値する：マルチコンセプトプロンプト学習を用いたオブジェクトレベルの概念学習

An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning

要旨

Support