個性化視覺指導調整

摘要

最近在多模式大型語言模型（MLLMs）方面取得了顯著進展；然而，這些模型存在一個顯著的限制，我們稱之為「面孔失認」。具體而言，它們可以進行一般對話，但無法進行針對特定個人的個性化對話。這種缺陷阻礙了在個性化設置中應用MLLMs，例如在移動設備上定製的視覺助手，或需要識別家庭成員的家用機器人。在本文中，我們介紹了個性化視覺指導調整（PVIT），這是一個新穎的數據整理和訓練框架，旨在使MLLMs能夠識別圖像中的目標個人並進行個性化和連貫的對話。我們的方法涉及開發一個複雜的流程，自動生成包含個性化對話的訓練數據。這個流程利用各種視覺專家、圖像生成模型和（多模式）大型語言模型的能力。為了評估MLLMs的個性化潛力，我們提出了一個名為P-Bench的基準，其中包含不同難度水平的各種問題類型。實驗表明，在使用我們精心策劃的數據集進行微調後，MLLMs的個性化性能得到了顯著提升。

English

Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.