CapRL: 強化学習による高密度画像キャプション生成能力の促進

要旨

画像キャプショニングは、視覚と言語の領域を橋渡しする基本的なタスクであり、大規模視覚言語モデル（LVLM）の事前学習において重要な役割を果たします。現在の最先端のキャプショニングモデルは、通常、人間や専有モデルによって注釈付けされた高コストでスケーラブルでないデータに依存する教師ありファインチューニング（SFT）を用いて訓練されます。このアプローチは、モデルが特定の正解を記憶する傾向を引き起こし、その汎用性や多様で創造的な記述を生成する能力を制限します。SFTの限界を克服するため、我々は、検証可能な報酬を伴う強化学習（RLVR）のパラダイムを、画像キャプショニングというオープンエンドのタスクに適用することを提案します。しかし、主な課題は、「良い」キャプションの本質的に主観的な性質に対する客観的な報酬関数を設計することです。我々は、キャプションの品質をその有用性を通じて再定義する新しい訓練フレームワークであるCaptioning Reinforcement Learning（CapRL）を導入します。高品質なキャプションは、非視覚的な言語モデルが対応する画像に関する質問に正確に答えることを可能にするべきです。CapRLは、LVLMがキャプションを生成し、そのキャプションに基づいて別の視覚情報を持たないLLMが多肢選択問題に答える精度から客観的な報酬を導出する、分離された2段階のパイプラインを採用します。主観的な画像キャプショニングタスクにRLVRを適用する最初の研究として、我々はCapRLが複数の設定で大幅に向上することを示します。CapRL-3Bによって注釈付けされたCapRL-5Mキャプションデータセットでの事前学習は、12のベンチマークで大幅な向上をもたらします。さらに、キャプション品質評価のためのPrism Framework内で、CapRLはQwen2.5-VL-72Bに匹敵する性能を達成し、ベースラインを平均8.4%上回ります。コードはこちらで利用可能です：https://github.com/InternLM/CapRL。

English

Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.

CapRL: 強化学習による高密度画像キャプション生成能力の促進

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

要旨

Support