記憶、検索、生成：無限のビジュアルコンセプトを理解するあなたのパーソナライズされたアシスタント

要旨

大規模言語モデル（LLMs）の開発は、多様なモーダルLLMs（MLLMs）の機能を大幅に向上させ、一般的なアシスタントとしての能力を高めました。しかし、ユーザー固有の知識の不足は、彼らの日常生活への適用を制限しています。本論文では、MLLMsの個人化のためのRetrieval Augmented Personalization（RAP）フレームワークを紹介します。一般的なMLLMから始めて、3つのステップで個人化されたアシスタントに変換します。 (a) Remember：ユーザー関連情報（例：ユーザーの名前、アバター、その他の属性）を保存するためのキー・バリューデータベースを設計します。 (b) Retrieve：ユーザーが会話を開始すると、RAPはマルチモーダルリトリーバーを使用してデータベースから関連情報を取得します。 (c) Generate：入力クエリと取得した概念情報をMLLMsに供給して、個人化された、知識を増強した応答を生成します。従来の方法とは異なり、RAPは外部データベースを更新することでリアルタイムの概念編集を可能にします。生成品質をさらに向上させ、ユーザー固有情報との整合性を高めるために、データ収集のためのパイプラインを設計し、MLLMsの個人化トレーニング用の専門データセットを作成します。このデータセットに基づいて、一連のMLLMsを個人化された多様なアシスタントとしてトレーニングします。大規模データセットで事前トレーニングを行うことで、RAP-MLLMsは追加の微調整なしに無限の視覚概念に汎化できます。当社のモデルは、個人化された画像キャプショニング、質問応答、および視覚認識などのさまざまなタスクにおいて、傑出した柔軟性と生成品質を示しています。コード、データ、およびモデルは、https://github.com/Hoar012/RAP-MLLM で入手可能です。

English

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.

記憶、検索、生成：無限のビジュアルコンセプトを理解するあなたのパーソナライズされたアシスタント

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

要旨

Support