一張圖片勝過千言萬語：使用多概念提示學習來學習物件級概念

摘要

文本反轉是一種快速學習方法，學習一個獨特的嵌入來代表圖像風格和外觀的新「詞」，使其能夠融入自然語言句子中，生成新的合成圖像。然而，即使個別概念的嵌入是可得的，識別並整合一個場景中的多個物體級概念也存在顯著挑戰。這一點在我們的實證測試中得到進一步證實。為應對這一挑戰，我們提出了一個多概念提示學習（MCPL）框架，從一個句子-圖像對中同時學習多個新「詞」。為增強詞-概念相關性的準確性，我們提出了三種正則化技術：注意力遮罩（AttnMask）集中學習於相關區域；提示對比損失（PromptCL）區分不同概念的嵌入；以及綁定形容詞（Bind adj.）將新「詞」與已知詞聯繫起來。我們通過圖像生成、編輯和注意力可視化與多樣圖像進行評估。廣泛的定量比較表明，我們的方法能夠學習出更多語義分離的概念，並具有增強的詞-概念相關性。此外，我們還為學習物體級概念的這一新任務量身定制了一個新的數據集和評估協議。

English

Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.

一張圖片勝過千言萬語：使用多概念提示學習來學習物件級概念

An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning

摘要

Support