BalCapRL:一個基於強化學習的多模態大語言模型圖像描述的平衡框架
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
May 8, 2026
作者: Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch
cs.AI
摘要
圖像描述是電腦視覺中最基本的任務之一。由於其開放式特性,在多模態大型語言模型(MLLMs)時代備受關注。為追求更詳細且準確的描述,近期研究逐漸轉向強化學習(RL)。然而,現有的基於強化學習的描述生成方法與評估指標,往往強調對描述品質的狹義理解,從而引發描述核心維度間的權衡取捨。例如,以實用性為導向的目標可能鼓勵產生帶有雜訊、幻覺或過長的描述,這雖然能改善下游問答任務,卻損害了流暢性;而以競技場式(arena-style)目標則可能偏愛流暢但缺乏實用性的通用描述。為解決此問題,我們提出一個更均衡的強化學習框架,聯合優化具實用性意識的正確性、參考覆蓋率與語言品質。為了有效優化所提出的連續多目標獎勵公式,我們將GDPO風格的獎勵解耦正規化應用於連續值的描述獎勵,並證明其表現優於原始GRPO。此外,我們引入長度條件獎勵遮罩,為描述生成提供更合適的長度懲罰。在LLaVA-1.5-7B、Qwen2.5-VL 3B及7B基礎模型上,我們的方法持續改善描述品質,在不同模型上分別達到+13.6的DCScore、+9.0的CaptionQA以及+29.0的CapArena最高增益。
English
Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.