추론 분해를 통한 자기 보상형 시각-언어 모델

초록

비전-언어 모델(VLMs)은 종종 시각적 환각(visual hallucination) 문제를 겪는데, 이는 이미지에 실제로 존재하지 않는 내용을 말하거나, 시각적 부분을 건너뛰고 텍스트 사전 지식에만 의존하는 언어적 단축(language shortcut) 현상을 보이는 것을 의미합니다. 이러한 문제는 대부분의 VLMs 사후 훈련 방법이 단순히 검증 가능한 답변 매칭에 의존하고 최종 출력만을 지도하기 때문에, 중간 시각적 추론 과정에 명시적인 지침이 부족하기 때문에 발생합니다. 결과적으로, VLMs은 희소한 시각적 신호를 받게 되고 종종 시각적 인식보다 언어 기반 추론을 우선시하도록 학습됩니다. 이를 완화하기 위해, 기존의 일부 방법은 인간의 주석이나 외부 대형 모델에서 추출한 레이블을 사용하여 시각적 지도를 추가합니다. 그러나 인간 주석은 노동 집약적이고 비용이 많이 들며, 외부 신호는 진화하는 정책에 적응할 수 없기 때문에 분포 변화를 일으켜 보상 해킹(reward hacking)으로 이어질 수 있습니다. 본 논문에서는 강화 학습을 통해 외부 시각적 지도 없이 시각적 추론을 개선하는 자기 보상(self-rewarding) 방법인 Vision-SR1을 소개합니다. Vision-SR1은 VLM의 추론을 시각적 인식과 언어 추론 두 단계로 분해합니다. 먼저 모델은 입력 이미지를 다시 참조하지 않고도 질문에 답할 수 있는 자체 포함된 시각적 인식을 생성하도록 유도됩니다. 이 자체 포함성을 검증하기 위해, 동일한 VLM 모델이 생성된 인식만을 입력으로 사용하여 언어 추론을 수행하도록 다시 유도되고, 이를 통해 보상을 계산합니다. 이 자기 보상은 최종 출력에 대한 지도와 결합되어 시각적 인식과 언어 추론 모두를 강화하는 균형 잡힌 훈련 신호를 제공합니다. 우리의 실험 결과, Vision-SR1은 다양한 비전-언어 작업에서 시각적 추론을 개선하고 시각적 환각을 완화하며 언어적 단축에 대한 의존도를 줄이는 것으로 나타났습니다.

English

Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

추론 분해를 통한 자기 보상형 시각-언어 모델

Self-Rewarding Vision-Language Model via Reasoning Decomposition

초록

Support