ChatPaper.aiChatPaper

RubiCap:基于量规引导强化学习的密集图像描述生成

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

March 10, 2026
作者: Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu
cs.AI

摘要

密集图像描述技术对于视觉语言预训练和文生图模型中的跨模态对齐至关重要,但专家级标注的规模化成本极高。虽然通过强视觉语言模型(VLM)进行合成标注是可行替代方案,但监督式蒸馏往往导致输出多样性受限和泛化能力薄弱。强化学习(RL)虽能突破这些局限,但其成功案例目前集中于依赖确定性验证器的可验证领域——这在开放式描述任务中难以实现。我们提出RubiCap这一新型RL框架突破该瓶颈,通过LLM撰写的评估准则生成细粒度的样本级奖励信号。该框架首先组建多样化候选描述委员会,继而利用LLM评估准则生成器提取共识优势并诊断当前策略缺陷。这些洞察被转化为显式评估标准,使LLM评判官能分解整体质量评估,以结构化多维度评价替代粗糙的标量奖励。在广泛基准测试中,RubiCap在CapArena平台上取得最高胜率,超越监督蒸馏、传统RL方法、人类专家标注及GPT-4V增强输出。在CaptionQA任务中展现出卓越的词汇效率:我们的70亿参数模型与Qwen2.5-VL-32B-Instruct表现相当,而30亿参数模型更胜其70亿版本。值得注意的是,使用轻量级RubiCap-3B作为描述器训练出的VLM,其性能甚至优于基于商用模型描述训练的VLM。
English
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
PDF102March 15, 2026