一張圖片勝過千言萬語:使用多概念提示學習來學習物件級概念
An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
October 18, 2023
作者: Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
cs.AI
摘要
文本反轉是一種快速學習方法,學習一個獨特的嵌入來代表圖像風格和外觀的新「詞」,使其能夠融入自然語言句子中,生成新的合成圖像。然而,即使個別概念的嵌入是可得的,識別並整合一個場景中的多個物體級概念也存在顯著挑戰。這一點在我們的實證測試中得到進一步證實。為應對這一挑戰,我們提出了一個多概念提示學習(MCPL)框架,從一個句子-圖像對中同時學習多個新「詞」。為增強詞-概念相關性的準確性,我們提出了三種正則化技術:注意力遮罩(AttnMask)集中學習於相關區域;提示對比損失(PromptCL)區分不同概念的嵌入;以及綁定形容詞(Bind adj.)將新「詞」與已知詞聯繫起來。我們通過圖像生成、編輯和注意力可視化與多樣圖像進行評估。廣泛的定量比較表明,我們的方法能夠學習出更多語義分離的概念,並具有增強的詞-概念相關性。此外,我們還為學習物體級概念的這一新任務量身定制了一個新的數據集和評估協議。
English
Textural Inversion, a prompt learning method, learns a singular embedding for
a new "word" to represent image style and appearance, allowing it to be
integrated into natural language sentences to generate novel synthesised
images. However, identifying and integrating multiple object-level concepts
within one scene poses significant challenges even when embeddings for
individual concepts are attainable. This is further confirmed by our empirical
tests. To address this challenge, we introduce a framework for Multi-Concept
Prompt Learning (MCPL), where multiple new "words" are simultaneously learned
from a single sentence-image pair. To enhance the accuracy of word-concept
correlation, we propose three regularisation techniques: Attention Masking
(AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss
(PromptCL) to separate the embeddings of different concepts; and Bind adjective
(Bind adj.) to associate new "words" with known words. We evaluate via image
generation, editing, and attention visualisation with diverse images. Extensive
quantitative comparisons demonstrate that our method can learn more
semantically disentangled concepts with enhanced word-concept correlation.
Additionally, we introduce a novel dataset and evaluation protocol tailored for
this new task of learning object-level concepts.