ViCrit: VLM 시각 인식을 위한 검증 가능한 강화 학습 프록시 작업

초록

강화 학습(Reinforcement Learning, RL)은 수학적 추론이나 코드 생성과 같이 도전적이면서도 쉽게 검증 가능한 작업을 사용하여 대규모 언어 모델(Large Language Models, LLMs)을 미세 조정하는 데 큰 효과를 보여왔습니다. 그러나 이러한 성공을 시각-언어 모델(Vision-Language Models, VLMs)의 시각 인식으로 확장하는 것은 동시에 도전적이고 명확하게 검증 가능한 시각 중심 작업의 부족으로 인해 방해받아 왔습니다. 이를 위해, 우리는 ViCrit(Visual Caption Hallucination Critic)을 소개합니다. ViCrit은 인간이 작성한 이미지 캡션의 단락에 주입된 미묘한 합성 시각적 환각을 지역화하도록 VLMs를 훈련시키는 RL 프록시 작업입니다. 200단어의 캡션에서 시작하여, 객체, 속성, 수량 또는 공간 관계를 변경하는 단일한 미묘한 시각적 설명 오류를 주입하고, 모델이 이미지와 수정된 캡션을 주어진 상태에서 손상된 범위를 정확히 찾아내도록 합니다. 이 공식은 완전한 인지적 난이도를 유지하면서도 계산하기 쉽고 명확한 이진 정확 일치 보상을 제공합니다. ViCrit 작업으로 훈련된 모델은 다양한 VL 벤치마크에서 상당한 향상을 보여줍니다. 특히, 이러한 개선은 자연 이미지 훈련 데이터를 넘어 추상 이미지 추론과 시각적 수학으로 전이되며, 단순히 본 객체를 기억하는 것이 아니라 인지하는 법을 배우는 가능성을 보여줍니다. 평가를 용이하게 하기 위해, 우리는 ViCrit-Bench를 추가로 소개합니다. ViCrit-Bench는 다양한 이미지 도메인과 오류 유형에 걸쳐 인식 오류를 체계적으로 탐색하는 범주 균형 진단 벤치마크입니다. 우리의 결과는 미세한 환각 비판이 VLMs의 시각 인식을 향상시키는 효과적이고 일반화 가능한 목표임을 보여줍니다.

English

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

ViCrit: VLM 시각 인식을 위한 검증 가능한 강화 학습 프록시 작업

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

초록

Support