CapRL: 강화 학습을 통한 조밀한 이미지 캡션 생성 능력 자극

초록

이미지 캡셔닝은 시각과 언어 영역을 연결하는 기본적인 과제로, 대규모 시각-언어 모델(LVLM)의 사전 학습에 중요한 역할을 합니다. 현재 최첨단 캡셔닝 모델들은 일반적으로 인간이나 독점 모델이 주석을 단 비용이 많이 들고 확장성이 없는 데이터에 의존하는 지도 미세 조정(SFT) 방식으로 훈련됩니다. 이 접근법은 종종 모델이 특정 정답을 암기하게 만들어 일반성을 제한하고 다양하고 창의적인 설명을 생성하는 능력을 저해합니다. SFT의 한계를 극복하기 위해, 우리는 검증 가능한 보상을 활용한 강화 학습(RLVR) 패러다임을 개방형 과제인 이미지 캡셔닝에 적용할 것을 제안합니다. 그러나 주요 과제는 "좋은" 캡션을 구성하는 본질적으로 주관적인 특성에 대한 객관적인 보상 함수를 설계하는 것입니다. 우리는 캡션 품질을 그 유용성을 통해 재정의하는 새로운 훈련 프레임워크인 캡셔닝 강화 학습(CapRL)을 소개합니다. 고품질 캡션은 시각 정보가 없는 언어 모델이 해당 이미지에 대한 질문에 정확하게 답할 수 있도록 해야 합니다. CapRL은 LVLM이 캡션을 생성하고, 별도의 시각 정보 없는 LLM이 해당 캡션만을 기반으로 다중 선택 질문에 답하는 정확도에서 객관적 보상을 도출하는 분리된 두 단계 파이프라인을 사용합니다. 주관적인 이미지 캡셔닝 과제에 RLVR을 적용한 첫 연구로서, 우리는 CapRL이 다양한 설정에서 상당한 개선을 이룸을 보여줍니다. CapRL-3B가 주석을 단 CapRL-5M 캡션 데이터셋으로 사전 학습한 결과, 12개 벤치마크에서 상당한 성능 향상을 달성했습니다. 또한 캡션 품질 평가를 위한 Prism 프레임워크 내에서 CapRL은 Qwen2.5-VL-72B에 필적하는 성능을 보이며, 기준선을 평균 8.4%의 차이로 능가했습니다. 코드는 https://github.com/InternLM/CapRL에서 확인할 수 있습니다.

English

Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.

CapRL: 강화 학습을 통한 조밀한 이미지 캡션 생성 능력 자극

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

초록

Support