ClaimDiff-RL: 시각적 주장 비교를 통한 세부 캡션 강화 학습

초록

긴 형식의 이미지 캡셔닝은 강화학습(RL)에서 보상 세분성 문제를 드러낸다. 캡션은 전체 시퀀스로 평가되지만, 중요한 오류는 개별 시각적 주장 수준에서 발생한다. 좋은 조밀 캡션은 사실에 충실하면서도 정보를 풍부하게 담아야 하며, 할루시네이션을 피하면서 중요한 세부 사항을 생략하지 않아야 한다. 그러나 쌍별 선호도, 참조 기반 메트릭, 전체론적 스칼라 보상은 이러한 국소적 오류를 단일 시퀀스 수준 신호로 압축하여 사실성과 포괄성 간의 상충 관계를 모호하게 만든다. 우리는 참조 조건부 원자적 주장 차이를 캡션 RL의 보상 단위로 사용하는 프레임워크인 ClaimDiff-RL을 소개한다. 이미지, 행위자 캡션, 참조 캡션이 주어지면, 멀티모달 평가자는 시각적으로 기반한 차이점들을 열거하고, 각 차이점을 이미지에 대해 검증하며, 개방 어휘 오류 유형과 심각도 수준을 할당하고, 보상 구성을 위한 차이점별 통계를 생성한다. 이를 통해 할루시네이션된 주장과 생략된 중요한 사실을 각각 측정하고 조정할 수 있다. 실험 결과, 전체론적 스칼라 보상은 누락된 사실을 증가시킴으로써 할루시네이션을 줄일 수 있는 반면, ClaimDiff-RL은 이러한 사실성과 포괄성 간의 상충 관계를 드러내고 더 균형 잡힌 운용 지점을 가능하게 한다. 160개 이미지로 구성된 인간 레이블 진단 벤치마크, 공개 캡셔닝 벤치마크, VQA 벤치마크에서 ClaimDiff-RL은 할루시네이션-누락 사실 균형을 개선하고, 일반적인 능력을 유지하며, 객체 계수, 공간 관계, 장면 인식과 같은 여러 세분화된 능력 차원에서 Gemini-3-Pro-Preview를 능가하기도 한다. 이러한 결과는 유형화되고 검증 가능한 주장 차이가 세분화되고 진단 가능한 캡션 RL을 위한 효과적인 보상 단위임을 시사한다.

English

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.