BalCapRL: RLベースのMLLM画像キャプション生成のためのバランスの取れたフレームワーク

要旨

画像キャプショニングはコンピュータビジョンにおける最も基本的なタスクの一つである。そのオープンエンドな性質から、マルチモーダル大規模言語モデル（MLLM）の時代において大きな注目を集めてきた。より詳細かつ正確なキャプションを求めて、近年の研究は強化学習（RL）に注目するようになってきている。しかしながら、既存のキャプショニングRL手法や評価指標は、しばしば狭いキャプション品質の概念を重視しており、キャプショニングの主要な次元間でのトレードオフを引き起こしている。例えば、実用性指向の目的は、下流の質問応答を向上させる一方で流暢さを損なう、ノイズの多い、幻覚を含む、または過度に長いキャプションを促進する可能性があり、一方、アリーナ型の目的は、流暢だが有用性に乏しい一般的な記述を好む可能性がある。この問題に対処するため、我々は実用性を考慮した正確性、参照カバレッジ、言語品質を共同で最適化する、よりバランスの取れた強化学習フレームワークを提案する。結果として得られる連続的な多目的報酬定式化を効果的に最適化するために、我々はGDPOスタイルの報酬分離正規化を連続値のキャプショニング報酬に適用し、それがバニラGRPOよりも性能を向上させることを示す。さらに、長さ条件付き報酬マスキングを導入し、キャプショニングにより適した長さペナルティを実現する。LLaVA-1.5-7B、Qwen2.5-VL 3B、7Bのベースモデルにおいて、本手法は一貫してキャプション品質を向上させ、異なるモデル間でDCScore +13.6、CaptionQA +9.0、CapArena +29.0の最大向上を達成した。

English

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.

BalCapRL: RLベースのMLLM画像キャプション生成のためのバランスの取れたフレームワーク

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

要旨

Support