ClaimDiff-RL: 視覚的主張比較による細粒度キャプション強化学習

要旨

長文画像キャプション生成では、強化学習（RL）における報酬粒度の問題が明らかになる。キャプションはシーケンス全体として評価される一方で、重要な誤りは個々の視覚的主張のレベルで発生する。優れた高密度キャプションは、忠実かつ情報豊かであり、幻覚を避けつつ、顕著な詳細を省略してはならない。しかし、ペアワイズ選好、参照ベースの評価指標、および全体的なスカラー報酬は、これらの局所的な誤りを単一のシーケンスレベルの信号に圧縮し、事実性と網羅性のトレードオフを不明瞭にする。我々はClaimDiff-RLを提案する。これは、参照条件付きの原子的な主張差分をキャプションRLの報酬単位として使用するフレームワークである。画像、アクターキャプション、および参照キャプションが与えられると、マルチモーダル判定器が視覚的に基づいた差分を列挙し、各差分を画像に対して検証し、オープンボキャブラリのエラータイプと重大度を割り当て、報酬構成のための差分ごとの統計を生成する。これにより、幻覚的な主張と省略された顕著な事実を別々に測定し調整可能になる。実験により、全体的なスカラー報酬は欠落事実を増やすことで幻覚を減少させることができるが、ClaimDiff-RLはこの忠実性と網羅性のトレードオフを明らかにし、よりバランスの取れた動作点を可能にすることが示された。160画像の人間ラベル付き診断ベンチマーク、公開キャプションベンチマーク、およびVQAベンチマークにおいて、ClaimDiff-RLは幻覚と欠落事実のバランスを改善し、一般的な能力を維持し、さらにはオブジェクトカウンティング、空間関係、シーン認識などのいくつかの細粒度のCapability次元においてGemini-3-Pro-Previewを凌駕した。これらの結果は、型付けされ検証可能な主張差分が、細粒度で診断可能なキャプションRLのための効果的な報酬単位であることを示唆している。

English

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.