RubiCap: 밀집 이미지 캡션 생성을 위한 루브릭 기반 강화 학습

초록

고밀도 영상 캡셔닝(dense image captioning)은 시각-언어 사전 학습 및 텍스트-이미지 생성에서 크로스 모달 정렬(cross-modal alignment)에 필수적이지만, 전문가 수준의 주석을 대규모로 확보하는 것은 비용이 매우 높습니다. 강력한 시각-언어 모델(VLM)을 통한 합성 캡셔닝은 실용적인 대안이지만, 지도 학습 기반 증류(distillation)는 종종 제한된 출력 다양성과 약한 일반화 성능을 보입니다. 강화 학습(RL)은 이러한 한계를 극복할 수 있으나, 그 성공은 결정론적 검사기(deterministic checker)에 의존하는 검증 가능한 영역에 집중되어 있습니다. 이는 개방형 캡셔닝에서는 사용하기 어려운 조건입니다. 우리는 LLM이 작성한 루브릭(rubric)에서 세분화된 샘플별 보상 신호를 도출하는 새로운 RL 프레임워크인 RubiCap으로 이 문제를 해결합니다. RubiCap은 먼저 다양한 후보 캡션 위원회를 구성한 다음, LLM 루브릭 작성기를 활용하여 현재 정책의 공통적 강점을 추출하고 결점을 진단합니다. 이러한 통찰은 명시적 평가 기준으로 변환되어, LLM 평가관이 전체적 품질 평가를 분해하고 단순한 스칼라 보상 대신 구조화된 다면적 평가를 가능하게 합니다. 다양한 벤치마크에서 RubiCap은 CapArena에서 가장 높은 승률을 기록하며, 지도 학습 증류, 기존 RL 방법, 인간 전문가 주석 및 GPT-4V 보강 출력을 능가했습니다. CaptionQA에서는 우수한 어휘 효율성을 보였습니다: 우리의 7B 모델은 Qwen2.5-VL-32B-Instruct와 동등한 성능을, 3B 모델은 해당 7B 모델을 능가했습니다. 특히, 소형 RubiCap-3B를 캡셔너로 사용하면 사적 모델의 캡션으로 학습된 VLM보다 더 강력한 사전 학습 VLM이 생성됩니다.

English

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.

RubiCap: 밀집 이미지 캡션 생성을 위한 루브릭 기반 강화 학습

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

초록

Support