RePIC:强化后训练实现多模态语言模型的个性化定制
RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
June 23, 2025
作者: Yeongtak Oh, Jisoo Mok, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Sungroh Yoon
cs.AI
摘要
近期,多模态大语言模型(MLLMs)在生成个性化图像描述方面常显乏力,即便是在高质量标注数据上训练后。本研究中,我们发现这一局限在现有的基于后训练方法的MLLM个性化策略中依然存在。具体而言,尽管通过监督微调(SFT)利用大规模标注数据进行后调优,这些模型在实际场景中,如多概念图像描述任务中,仍难以生成忠实于图像的描述。然而,获取此类复杂场景下的大规模高质量标注既昂贵又困难。针对SFT以数据为中心的特性,我们提出了一种基于强化学习(RL)的后训练框架。据我们所知,这是首个采用RL方法对MLLMs进行后训练以实现个性化图像描述的研究。我们的方法显著提升了MLLMs的视觉识别与个性化生成能力,并在多概念图像描述这一挑战性任务中,持续超越现有的基于SFT的基线模型。
English
Recent multi-modal large language models (MLLMs) often struggle to generate
personalized image captions, even when trained on high-quality captions. In
this work, we observe that such limitations persist in existing
post-training-based MLLM personalization methods. Specifically, despite being
post-tuned with large-scale caption data through supervised fine-tuning (SFT),
these models frequently fail to produce faithful descriptions in real-world
scenarios, such as multi-concept image captioning. However, acquiring
large-scale, high-quality captions for such complex settings is both costly and
difficult. To address the data-centric nature of SFT, we propose a
reinforcement learning (RL)-based post-training framework. To the best of our
knowledge, this is the first RL-based approach to post-train MLLMs for
personalized image captioning. Our method significantly enhances both visual
recognition and personalized generation capabilities of MLLMs, and consistently
outperforms existing SFT-based baselines, especially in the challenging
multi-concept image captioning task.