ChatPaper.aiChatPaper

PersonaVLM:长期个性化多模态大语言模型

PersonaVLM: Long-Term Personalized Multimodal LLMs

March 20, 2026
作者: Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan
cs.AI

摘要

多模态大语言模型(MLLMs)已成为数百万用户的日常助手,但其生成符合个体偏好回复的能力仍存在局限。现有方法仅能通过输入增强或输出对齐实现静态的单轮个性化,无法捕捉用户随时间演变的偏好与个性特征(见图1)。本文提出PersonaVLM——一个面向长期个性化的创新型个性化多模态智能体框架。该框架通过整合三大核心能力将通用MLLM转化为个性化助手:(a)记忆能力:主动从交互中提取并总结时序多模态记忆,将其整合至个性化数据库;(b)推理能力:通过检索并融合数据库中的相关记忆进行多轮推理;(c)响应对齐:在长期交互中推断用户动态变化的个性特征,确保输出始终契合其独特属性。为进行评估,我们构建了Persona-MME基准数据集,包含逾2000个精心策划的交互案例,用于从七个核心维度和14项细粒度任务评估MLLM的长期个性化性能。大量实验验证了本方法的有效性:在128k上下文长度下,基线模型在Persona-MME和PERSONAMEM数据集上分别提升22.4%和9.8%,同时较GPT-4o分别领先5.2%和2.0%。项目页面:https://PersonaVLM.github.io。
English
Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.