RubiCap：基于量规引导强化学习的密集图像描述生成

摘要

密集图像描述技术对于视觉语言预训练及文生图任务中的跨模态对齐至关重要，但专家级标注的规模化成本极高。虽然通过强视觉语言模型生成合成描述是可行替代方案，但监督式蒸馏往往导致输出多样性受限与泛化能力薄弱。强化学习虽能突破这些局限，但其成功案例目前集中于依赖确定性验证器的可量化领域——这种条件在开放式描述任务中难以实现。我们提出的RubiCap框架通过LLM撰写的评估准则生成细粒度、样本特定的奖励信号，从而突破这一瓶颈。该框架首先生成多样化候选描述集合，随后调用LLM评估准则生成器提取共识优势并诊断当前策略缺陷，将这些洞察转化为显式评估标准，使LLM评判器能分解整体质量评估，以结构化多维度评价替代粗糙的标量奖励。在多项基准测试中，RubiCap在CapArena平台上取得最高胜率，超越监督蒸馏、传统强化学习方法、人工专家标注及GPT-4V增强输出。在CaptionQA任务中展现出卓越的词汇效率：我们的70亿参数模型与Qwen2.5-VL-32B-Instruct表现相当，而30亿参数模型甚至超越其70亿参数版本。值得注意的是，使用轻量级RubiCap-3B作为描述器训练出的视觉语言模型，其性能优于基于商用模型描述的预训练模型。

English

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.