BalCapRL: RL 기반 MLLM 이미지 캡셔닝을 위한 균형 잡힌 프레임워크

초록

이미지 캡셔닝은 컴퓨터 비전 분야에서 가장 기본적인 작업 중 하나이다. 개방형(open-ended) 특성으로 인해 멀티모달 대규모 언어 모델(MLLM) 시대에 큰 주목을 받아왔다. 더욱 상세하고 정확한 캡션을 추구하기 위해 최근 연구는 점차 강화 학습(RL)으로 전환되고 있다. 그러나 기존의 캡셔닝-강화 학습 방법과 평가 지표는 종종 캡션 품질의 좁은 개념만을 강조하여 캡셔닝의 핵심 차원 간에 상충 관계를 유발한다. 예를 들어, 유용성 중심의 목표는 하류 작업의 질의응답 성능을 향상시키면서도 유창성을 해치는, 잡음이 많거나 환각된 또는 과도하게 긴 캡션을 조장할 수 있는 반면, 아레나(arena) 스타일의 목표는 유창하지만 제한적인 유용성을 가진 일반적인 설명을 선호할 수 있다. 이를 해결하기 위해, 우리는 유용성 인식 정확성, 참조 커버리지, 언어적 품질을 공동으로 최적화하는 보다 균형 잡힌 강화 학습 프레임워크를 제안한다. 결과적인 연속 다중 목표 보상 공식을 효과적으로 최적화하기 위해, 우리는 연속값 캡셔닝 보상에 GDPO 스타일의 보상 분리 정규화를 적용하고, 이것이 바닐라 GRPO보다 성능을 향상시킴을 보인다. 또한, 길이 조건부 보상 마스킹을 도입하여 캡셔닝에 더 적합한 길이 패널티를 제공한다. LLaVA-1.5-7B, Qwen2.5-VL 3B 및 7B 기본 모델에서 우리의 방법은 일관되게 캡션 품질을 향상시키며, 모델에 따라 DCScore +13.6, CaptionQA +9.0, CapArena +29.0의 최대 성능 향상을 달성한다.

English

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.

BalCapRL: RL 기반 MLLM 이미지 캡셔닝을 위한 균형 잡힌 프레임워크

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

초록

Support