个性化视觉指导调整
Personalized Visual Instruction Tuning
October 9, 2024
作者: Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong Zhang
cs.AI
摘要
最近在多模态大型语言模型(MLLMs)方面取得了显著进展;然而,这些模型存在一个显著的限制,我们称之为“面孔失认”。具体来说,它们可以参与一般性对话,但无法进行针对特定个体的个性化对话。这一缺陷阻碍了在个性化环境中应用MLLMs,比如在移动设备上定制的视觉助手,或者需要识别家庭成员的家用机器人。在本文中,我们介绍了个性化视觉指导调整(PVIT),这是一个新颖的数据整理和训练框架,旨在使MLLMs能够识别图像中的目标个体,并进行个性化和连贯的对话。我们的方法涉及开发一个复杂的流程,自动生成包含个性化对话的训练数据。这个流程利用各种视觉专家、图像生成模型和(多模态)大型语言模型的能力。为了评估MLLMs的个性化潜力,我们提出了一个名为P-Bench的基准,其中包含不同难度级别的各种问题类型。实验表明,在使用我们策划的数据集进行微调后,个性化性能得到了显著提升。
English
Recent advancements in multimodal large language models (MLLMs) have
demonstrated significant progress; however, these models exhibit a notable
limitation, which we refer to as "face blindness". Specifically, they can
engage in general conversations but fail to conduct personalized dialogues
targeting at specific individuals. This deficiency hinders the application of
MLLMs in personalized settings, such as tailored visual assistants on mobile
devices, or domestic robots that need to recognize members of the family. In
this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel
data curation and training framework designed to enable MLLMs to identify
target individuals within an image and engage in personalized and coherent
dialogues. Our approach involves the development of a sophisticated pipeline
that autonomously generates training data containing personalized
conversations. This pipeline leverages the capabilities of various visual
experts, image generation models, and (multi-modal) large language models. To
evaluate the personalized potential of MLLMs, we present a benchmark called
P-Bench, which encompasses various question types with different levels of
difficulty. The experiments demonstrate a substantial personalized performance
enhancement after fine-tuning with our curated dataset.Summary
AI-Generated Summary