MyVLM:針對使用者特定查詢的個性化VLMs
MyVLM: Personalizing VLMs for User-Specific Queries
March 21, 2024
作者: Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
cs.AI
摘要
最近大規模視覺語言模型(VLMs)展示了在理解和生成視覺內容的文本描述方面的顯著能力。然而,這些模型缺乏對用戶特定概念的理解。在這項工作中,我們邁出了個性化VLMs的第一步,使其能夠學習和推理用戶提供的概念。例如,我們探索這些模型是否能夠學會在圖像中識別您並傳達您的活動,使模型能夠反映您的個人經歷和關係。為了有效識別各種用戶特定概念,我們通過外部概念頭來擴充VLM,這些頭作為模型的開關,使VLM能夠識別給定圖像中特定目標概念的存在。在識別了概念後,我們在VLM的中間特徵空間中學習一個新的概念嵌入。這個嵌入的任務是引導語言模型自然地將目標概念整合到其生成的回應中。我們將這一技術應用於BLIP-2和LLaVA,用於個性化圖像標題生成,並進一步展示其在個性化視覺問答方面的應用。我們的實驗表明,我們能夠將學習的概念泛化應用於未見過的圖像,同時保留模型對不相關輸入的行為。
English
Recent large-scale vision-language models (VLMs) have demonstrated remarkable
capabilities in understanding and generating textual descriptions for visual
content. However, these models lack an understanding of user-specific concepts.
In this work, we take a first step toward the personalization of VLMs, enabling
them to learn and reason over user-provided concepts. For example, we explore
whether these models can learn to recognize you in an image and communicate
what you are doing, tailoring the model to reflect your personal experiences
and relationships. To effectively recognize a variety of user-specific
concepts, we augment the VLM with external concept heads that function as
toggles for the model, enabling the VLM to identify the presence of specific
target concepts in a given image. Having recognized the concept, we learn a new
concept embedding in the intermediate feature space of the VLM. This embedding
is tasked with guiding the language model to naturally integrate the target
concept in its generated response. We apply our technique to BLIP-2 and LLaVA
for personalized image captioning and further show its applicability for
personalized visual question-answering. Our experiments demonstrate our ability
to generalize to unseen images of learned concepts while preserving the model
behavior on unrelated inputs.Summary
AI-Generated Summary