MyVLM:为用户特定查询个性化的VLMs
MyVLM: Personalizing VLMs for User-Specific Queries
March 21, 2024
作者: Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
cs.AI
摘要
最近的大规模视觉-语言模型(VLMs)展示了在理解和生成视觉内容的文本描述方面的显著能力。然而,这些模型缺乏对用户特定概念的理解。在这项工作中,我们迈出了个性化VLMs的第一步,使其能够学习和推理用户提供的概念。例如,我们探讨这些模型是否能够学会在图像中识别您并传达您正在做什么,使模型能够反映您的个人经历和关系。为了有效识别各种用户特定概念,我们通过增加外部概念头来增强VLM,这些头作为模型的开关,使VLM能够识别给定图像中特定目标概念的存在。在识别了概念之后,我们在VLM的中间特征空间中学习一个新的概念嵌入。这个嵌入的任务是引导语言模型自然地将目标概念整合到其生成的响应中。我们将这一技术应用于BLIP-2和LLaVA,用于个性化图像字幕生成,并进一步展示了它在个性化视觉问答方面的适用性。我们的实验表明,我们能够推广到学习概念的未见图像,同时保持模型对不相关输入的行为。
English
Recent large-scale vision-language models (VLMs) have demonstrated remarkable
capabilities in understanding and generating textual descriptions for visual
content. However, these models lack an understanding of user-specific concepts.
In this work, we take a first step toward the personalization of VLMs, enabling
them to learn and reason over user-provided concepts. For example, we explore
whether these models can learn to recognize you in an image and communicate
what you are doing, tailoring the model to reflect your personal experiences
and relationships. To effectively recognize a variety of user-specific
concepts, we augment the VLM with external concept heads that function as
toggles for the model, enabling the VLM to identify the presence of specific
target concepts in a given image. Having recognized the concept, we learn a new
concept embedding in the intermediate feature space of the VLM. This embedding
is tasked with guiding the language model to naturally integrate the target
concept in its generated response. We apply our technique to BLIP-2 and LLaVA
for personalized image captioning and further show its applicability for
personalized visual question-answering. Our experiments demonstrate our ability
to generalize to unseen images of learned concepts while preserving the model
behavior on unrelated inputs.Summary
AI-Generated Summary