RubiCap: ルーブリックに基づく強化学習による高密度画像キャプション生成

要旨

高密度画像キャプショニングは、視覚言語事前学習やテキストから画像への生成におけるクロスモーダル対応付けにおいて重要であるが、専門家品質のアノテーションを大規模化することは法外なコストがかかる。強力な視覚言語モデル（VLM）を用いた合成的キャプショニングは現実的な代替手段であるが、教師あり蒸留では出力の多様性や一般化性能が限られることが多い。強化学習（RL）はこれらの制限を克服できる可能性があるが、その成功はこれまで決定論的チェッカーに依存する検証可能な領域に集中しており、オープンエンドなキャプショニングでは利用できない。我々はこのボトルネックを解決するため、LLMが作成する評価基準から細粒度でサンプル固有の報酬信号を導出する新規RLフレームワーク「RubiCap」を提案する。RubiCapはまず多様な候補キャプションの委員会を構築し、次にLLM評価基準作成器を用いて現在のポリシーの合意された強みを抽出し、欠点を診断する。これらの知見を明示的な評価基準に変換することで、LLM評価器が全体的な品質評価を分解し、粗いスカラー報酬を構造化された多面的評価に置き換えることを可能にする。大規模なベンチマークにおいて、RubiCapはCapArenaで最高の勝率を達成し、教師あり蒸留、従来のRL手法、人間専門家のアノテーション、GPT-4V拡張出力を上回った。CaptionQAでは優れた単語効率を示し、7BモデルはQwen2.5-VL-32B-Instructに匹敵し、3Bモデルはその7B対応モデルを凌駕した。特筆すべきは、コンパクトなRubiCap-3Bをキャプショナーとして使用すると、プロプライエタリモデルからのキャプションで学習したVLMよりも強力な事前学習済みVLMが得られる点である。

English

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.

RubiCap: ルーブリックに基づく強化学習による高密度画像キャプション生成

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

要旨

Support