個別化されたビジュアル指示調整

要旨

最近の多様な言語モデル（MLLMs）の進歩により、著しい進展が示されています。しかしながら、これらのモデルには「顔の識別障害」と呼ばれる顕著な制約があります。具体的には、一般的な会話はできるものの、特定の個人を対象とした個別の対話を行うことができません。この欠点は、モバイルデバイス上のカスタマイズされたビジュアルアシスタントや家庭用ロボットなど、個人に対応する環境でのMLLMsの適用を妨げています。本論文では、Personalized Visual Instruction Tuning（PVIT）という新しいデータキュレーションおよびトレーニングフレームワークを紹介し、MLLMsが画像内の対象個人を識別し、個別かつ一貫した対話を行うことを可能にするよう設計されています。当該手法には、個別の会話を含むトレーニングデータを自律的に生成するための高度なパイプラインの開発が含まれています。このパイプラインは、さまざまなビジュアルエキスパート、画像生成モデル、および（多様なモードの）大規模言語モデルの能力を活用しています。MLLMsの個別化の潜在能力を評価するために、P-Benchと呼ばれるベンチマークを提示しています。このベンチマークには、さまざまな難易度の質問タイプが含まれています。実験は、当社のキュレーションされたデータセットでのファインチューニング後に顕著な個別化パフォーマンスの向上を示しています。

English

Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.