ClaimDiff-RL：通过视觉声明比较的细粒度字幕强化学习

摘要

长格式图像字幕生成暴露了强化学习中的奖励粒度问题：字幕被作为完整序列进行评判，而重要错误发生在单个视觉断言的层面上。一个好的密集字幕应当既忠实又信息丰富，避免幻觉的同时不遗漏显著细节。然而，成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单一序列级信号，掩盖了事实性与覆盖率之间的权衡。我们提出ClaimDiff-RL框架，该框架使用以参考为条件的原子化断言差异作为字幕强化学习的奖励单元。给定图像、演员字幕和参考字幕，多模态评审者枚举与视觉相关的差异，逐条验证每个差异与图像的对应性，分配开放式词汇的错误类型与严重程度，并为奖励组合生成每项差异的统计信息。这使得幻觉性断言与被遗漏的显著事实可分别进行衡量和调节。实验表明，整体标量奖励可能通过增加遗漏事实来减少幻觉，而ClaimDiff-RL揭示了这种忠实度与覆盖率之间的权衡，并实现了更平衡的运行点。在一个包含160幅图像的人工标注诊断基准、公开字幕基准和VQA基准上，ClaimDiff-RL改善了幻觉与遗漏事实的平衡，保持了一般能力，甚至在某些细粒度能力维度（如物体计数、空间关系和场景识别）上超越了Gemini-3-Pro-Preview。这些结果表明，带有类型化、可验证的断言差异是面向细粒度、可诊断的字幕强化学习的有效奖励单元。

English

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.