RePIC: 다중 모달 언어 모델 개인화를 위한 강화된 사후 학습

초록

최근의 다중 모드 대형 언어 모델(MLLM)은 고품질 캡션 데이터로 학습되었음에도 불구하고 개인화된 이미지 캡션 생성에 어려움을 겪는 경우가 많다. 본 연구에서는 이러한 한계가 기존의 사후 학습 기반 MLLM 개인화 방법에서도 지속적으로 나타남을 관찰하였다. 특히, 대규모 캡션 데이터를 지도 미세 조정(SFT)을 통해 사후 조정했음에도 불구하고, 이러한 모델들은 다중 개념 이미지 캡션 생성과 같은 실제 시나리오에서 충실한 설명을 생성하지 못하는 경우가 빈번하다. 그러나 이러한 복잡한 설정을 위한 대규모 고품질 캡션 데이터를 확보하는 것은 비용이 많이 들고 어려운 작업이다. 이러한 SFT의 데이터 중심적 특성을 해결하기 위해, 우리는 강화 학습(RL) 기반의 사후 학습 프레임워크를 제안한다. 우리가 아는 한, 이는 개인화된 이미지 캡션 생성을 위해 MLLM을 사후 학습하는 최초의 RL 기반 접근법이다. 우리의 방법은 MLLM의 시각적 인식 및 개인화된 생성 능력을 크게 향상시키며, 특히 도전적인 다중 개념 이미지 캡션 생성 작업에서 기존의 SFT 기반 베이스라인을 지속적으로 능가한다.

English

Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.

RePIC: 다중 모달 언어 모델 개인화를 위한 강화된 사후 학습

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

초록

Support