一图胜过千言万语：利用多概念提示学习学习物体级概念

摘要

文本反转是一种快速学习方法，学习一个独特的嵌入来代表图像风格和外观，使其能够被整合到自然语言句子中以生成新颖的合成图像。然而，即使可以获得各个概念的嵌入，识别和整合一个场景中的多个对象级概念仍然存在重大挑战。我们的实证测试进一步证实了这一点。为了解决这一挑战，我们引入了一个多概念提示学习（MCPL）框架，可以同时从单个句子-图像对中学习多个新的“词”。为了增强词-概念相关性的准确性，我们提出了三种正则化技术：注意力掩模（AttnMask）集中学习在相关区域；提示对比损失（PromptCL）分离不同概念的嵌入；绑定形容词（Bind adj.）将新的“词”与已知词关联。我们通过图像生成、编辑和注意力可视化与多样化图像进行评估。广泛的定量比较表明，我们的方法可以学习到更多语义上解耦的概念，并具有增强的词-概念相关性。此外，我们还为学习对象级概念的新任务量身定制了一种新的数据集和评估协议。

English

Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.

一图胜过千言万语：利用多概念提示学习学习物体级概念

An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning

摘要

Support