记忆、检索和生成：理解无限视觉概念作为您的个性化助手

摘要

大型语言模型（LLMs）的发展显著增强了多模态LLMs（MLLMs）作为通用助手的能力。然而，缺乏用户特定知识仍然限制了它们在人类日常生活中的应用。本文介绍了用于MLLMs个性化的检索增强个性化（RAP）框架。从通用MLLM开始，我们通过三个步骤将其转变为个性化助手。（a）记忆：我们设计了一个键-值数据库来存储与用户相关的信息，例如用户的姓名、头像和其他属性。（b）检索：当用户开始对话时，RAP将使用多模态检索器从数据库中检索相关信息。（c）生成：将输入查询和检索到的概念信息输入MLLMs，生成个性化、知识增强的响应。与以往的方法不同，RAP允许通过更新外部数据库实现实时概念编辑。为了进一步提高生成质量并与用户特定信息对齐，我们设计了一个数据收集流程，并创建了一个专门用于MLLMs个性化训练的数据集。基于该数据集，我们训练了一系列个性化多模态助手MLLMs。通过在大规模数据集上进行预训练，RAP-MLLMs可以在不进行额外微调的情况下泛化到无限的视觉概念。我们的模型在各种任务中展现出出色的灵活性和生成质量，例如个性化图像字幕、问题回答和视觉识别。代码、数据和模型可在https://github.com/Hoar012/RAP-MLLM找到。

English

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.

记忆、检索和生成：理解无限视觉概念作为您的个性化助手

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

摘要

Support