ClaimDiff-RL：通過視覺主張比較的細粒度描述強化學習

摘要

長格式圖像字幕揭示了強化學習中的獎勵粒度問題：字幕被視為完整序列進行評判，而重要的錯誤發生在個別視覺陳述的層級。一個良好的密集描述應兼具忠實性與資訊豐富性，避免幻覺卻不遺漏關鍵細節。然而，成對偏好、基於參考的指標以及整體標量獎勵將這些局部錯誤壓縮為單一的序列級訊號，模糊了事實性與覆蓋率之間的權衡。我們提出ClaimDiff-RL框架，該框架使用以參考為條件的原子陳述差異作為字幕強化學習的獎勵單位。給定一張圖像、一段演員字幕與一段參考字幕，多模態判斷器會列舉出基於視覺的差異，針對每項差異與圖像進行驗證，賦予開放詞彙的錯誤類型與嚴重程度，並產生每項差異的統計數據以組成獎勵。這使得幻覺陳述與遺漏的重要事實得以分別衡量與調整。實驗顯示，整體標量獎勵可能透過增加遺漏事實來減少幻覺，而ClaimDiff-RL則揭示了這種忠實性與覆蓋率的權衡，並實現更平衡的操作點。在一個包含160張圖像的人工標註診斷基準、公開字幕基準以及視覺問答基準上，ClaimDiff-RL改善了幻覺與遺漏事實的平衡，保留了通用能力，甚至在物體計數、空間關係與場景識別等多項細粒度能力維度上超越了Gemini-3-Pro-Preview。這些結果表明，帶有類型且可驗證的陳述差異是實現細粒度且可診斷的字幕強化學習的有效獎勵單位。

English

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.