视觉-语言模型中的情境化视觉个性化
Contextualized Visual Personalization in Vision-Language Models
February 3, 2026
作者: Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon
cs.AI
摘要
尽管视觉语言模型(VLMs)近期取得进展,现有方法仍难以基于用户特定经历生成个性化响应,因其缺乏将视觉输入与用户累积的视觉-文本语境相关联的能力。我们首次将这一挑战形式化为情境化视觉个性化,要求VLM在解析新图像时能够对个性化视觉经验进行视觉识别与文本检索。为解决该问题,我们提出CoViP统一框架,将个性化图像描述作为情境化视觉个性化的核心任务,并通过基于强化学习的后训练与描述增强生成技术提升该能力。我们进一步设计了诊断性评估方案,明确排除文本捷径解决方案,验证VLM是否真正利用视觉语境。大量实验表明,现有开源与专有VLM存在明显局限性,而CoViP不仅能提升个性化图像描述性能,还能在下游个性化任务中实现全面增益。这些成果凸显CoViP为实现鲁棒且可泛化的情境化视觉个性化奠定了关键基础。
English
Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.