ChatPaper.aiChatPaper

CapRL:通过强化学习激发密集图像描述能力

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

September 26, 2025
作者: Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin
cs.AI

摘要

图像描述是一项连接视觉与语言领域的基础任务,在大型视觉-语言模型(LVLMs)的预训练中扮演着关键角色。当前最先进的描述模型通常通过监督微调(SFT)进行训练,这一范式依赖于昂贵且难以扩展的人工或专有模型标注数据。这种方法往往导致模型记忆特定标准答案,限制了其泛化能力及生成多样化、创造性描述的能力。为克服SFT的局限,我们提出将可验证奖励的强化学习(RLVR)范式应用于开放式的图像描述任务。然而,主要挑战在于为“优质”描述这一主观性极强的概念设计客观的奖励函数。我们引入了描述强化学习(CapRL),这是一个新颖的训练框架,通过描述的实际效用重新定义描述质量:高质量的描述应能使非视觉语言模型准确回答关于对应图像的问题。CapRL采用解耦的两阶段流程,其中LVLM生成描述,而客观奖励则基于仅依赖该描述回答多项选择题的独立、无视觉大语言模型(LLM)的准确率得出。作为首个将RLVR应用于主观图像描述任务的研究,我们展示了CapRL在多种设置下显著提升性能。使用由CapRL-3B标注的CapRL-5M描述数据集进行预训练,在12个基准测试中取得了显著进步。此外,在描述质量评估的Prism框架内,CapRL的表现与Qwen2.5-VL-72B相当,同时平均超出基线8.4%。代码可在此获取:https://github.com/InternLM/CapRL。
English
Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.
PDF302September 29, 2025