ChatPaper.aiChatPaper

CapRL:透過強化學習激發密集圖像描述能力

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

September 26, 2025
作者: Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin
cs.AI

摘要

圖像描述是一項連接視覺與語言領域的基礎任務,在大型視覺語言模型(LVLMs)的預訓練中扮演著關鍵角色。當前最先進的描述模型通常通過監督微調(SFT)進行訓練,這一範式依賴於昂貴且難以擴展的人類或專有模型註釋數據。這種方法往往導致模型記住特定的標準答案,限制了其通用性及生成多樣化、創造性描述的能力。為克服SFT的侷限,我們提出將可驗證獎勵的強化學習(RLVR)範式應用於開放式圖像描述任務。然而,主要挑戰在於為“好”描述這一主觀本質設計客觀的獎勵函數。我們引入了描述強化學習(CapRL),這是一種新穎的訓練框架,通過描述的有用性重新定義描述質量:高質量的描述應使非視覺語言模型能夠準確回答關於相應圖像的問題。CapRL採用解耦的兩階段流程,其中LVLM生成描述,而客觀獎勵則基於一個獨立的、無視覺的LLM僅根據該描述回答多選題的準確性來確定。作為首個將RLVR應用於主觀圖像描述任務的研究,我們展示了CapRL在多種設置下顯著提升性能。在由CapRL-3B註釋的CapRL-5M描述數據集上進行預訓練,在12個基準測試中取得了顯著提升。此外,在描述質量評估的Prism框架內,CapRL的表現與Qwen2.5-VL-72B相當,同時平均超出基線8.4%。代碼可在此處獲取:https://github.com/InternLM/CapRL。
English
Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.
PDF302September 29, 2025