ChatPaper.aiChatPaper

BalCapRL:基于强化学习的多模态大语言模型图像描述的平衡框架

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

May 8, 2026
作者: Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch
cs.AI

摘要

图像描述是计算机视觉中最基础的任务之一。因其具有开放式的特性,在多模态大语言模型时代受到了广泛关注。为了追求更详细、更准确的描述,近期研究越来越多地转向强化学习方法。然而,现有的描述生成强化学习方法及评估指标往往强调描述质量的单一维度,导致描述核心维度之间存在相互权衡。例如,以实用性为导向的目标可能会鼓励生成带有噪声、幻觉或过长的描述,从而提升下游问答任务的表现,却损害了流畅性;而以竞技场为导向的目标则可能倾向于生成流畅但通用性较强、实用性有限的描述。为解决这一问题,我们提出了一种更均衡的强化学习框架,该框架共同优化了基于实用性的正确性、参考描述的覆盖度以及语言质量。为了有效优化由此产生的连续多目标奖励形式,我们应用了基于GDPO风格的奖励解耦归一化方法处理连续值的描述奖励,并证明该方法优于原始GRPO方法。此外,我们引入了基于长度的条件奖励掩码,为描述生成提供了更合适的长度惩罚机制。在LLaVA-1.5-7B和Qwen2.5-VL 3B及7B基础模型上,我们的方法持续提升了描述质量,在不同模型上分别获得了最高+13.6的DCScore、+9.0的CaptionQA和+29.0的CapArena提升。
English
Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.
PDF11May 12, 2026