记忆、检索和生成:理解无限视觉概念作为您的个性化助手
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
October 17, 2024
作者: Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue
cs.AI
摘要
大型语言模型(LLMs)的发展显著增强了多模态LLMs(MLLMs)作为通用助手的能力。然而,缺乏用户特定知识仍然限制了它们在人类日常生活中的应用。本文介绍了用于MLLMs个性化的检索增强个性化(RAP)框架。从通用MLLM开始,我们通过三个步骤将其转变为个性化助手。 (a)记忆:我们设计了一个键-值数据库来存储与用户相关的信息,例如用户的姓名、头像和其他属性。 (b)检索:当用户开始对话时,RAP将使用多模态检索器从数据库中检索相关信息。 (c)生成:将输入查询和检索到的概念信息输入MLLMs,生成个性化、知识增强的响应。与以往的方法不同,RAP允许通过更新外部数据库实现实时概念编辑。为了进一步提高生成质量并与用户特定信息对齐,我们设计了一个数据收集流程,并创建了一个专门用于MLLMs个性化训练的数据集。基于该数据集,我们训练了一系列个性化多模态助手MLLMs。通过在大规模数据集上进行预训练,RAP-MLLMs可以在不进行额外微调的情况下泛化到无限的视觉概念。我们的模型在各种任务中展现出出色的灵活性和生成质量,例如个性化图像字幕、问题回答和视觉识别。代码、数据和模型可在https://github.com/Hoar012/RAP-MLLM找到。
English
The development of large language models (LLMs) has significantly enhanced
the capabilities of multimodal LLMs (MLLMs) as general assistants. However,
lack of user-specific knowledge still restricts their application in human's
daily life. In this paper, we introduce the Retrieval Augmented Personalization
(RAP) framework for MLLMs' personalization. Starting from a general MLLM, we
turn it into a personalized assistant in three steps. (a) Remember: We design a
key-value database to store user-related information, e.g., user's name, avatar
and other attributes. (b) Retrieve: When the user initiates a conversation, RAP
will retrieve relevant information from the database using a multimodal
retriever. (c) Generate: The input query and retrieved concepts' information
are fed into MLLMs to generate personalized, knowledge-augmented responses.
Unlike previous methods, RAP allows real-time concept editing via updating the
external database. To further improve generation quality and alignment with
user-specific information, we design a pipeline for data collection and create
a specialized dataset for personalized training of MLLMs. Based on the dataset,
we train a series of MLLMs as personalized multimodal assistants. By
pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual
concepts without additional finetuning. Our models demonstrate outstanding
flexibility and generation quality across a variety of tasks, such as
personalized image captioning, question answering and visual recognition. The
code, data and models are available at https://github.com/Hoar012/RAP-MLLM.Summary
AI-Generated Summary