基於推理分解的自獎勵視覺語言模型

摘要

視覺語言模型（VLMs）常面臨視覺幻覺問題，即描述圖像中並不存在的事物，以及語言捷徑問題，即跳過視覺部分而僅依賴文本先驗。這些問題的產生，主要是因為大多數VLMs的後訓練方法依賴於簡單的可驗證答案匹配，並且僅對最終輸出進行監督，使得中間的視覺推理缺乏明確的指導。因此，VLMs接收到的視覺信號稀疏，往往傾向於優先基於語言的推理而非視覺感知。為緩解這一問題，現有的一些方法通過使用人工標註或從外部大型模型蒸餾出的標籤來增加視覺監督。然而，人工標註既耗時又成本高昂，且由於外部信號無法適應不斷演變的策略，它們會導致分佈偏移，從而可能引發獎勵欺詐。本文提出了一種自我獎勵方法——Vision-SR1，該方法通過強化學習在不依賴外部視覺監督的情況下提升視覺推理能力。Vision-SR1將VLM的推理分解為兩個階段：視覺感知與語言推理。首先，模型被提示生成自足的視覺感知，這些感知足以回答問題而無需回顧輸入圖像。為驗證這種自足性，同一VLM模型隨後被重新提示，僅使用生成的感知作為輸入進行語言推理以計算獎勵。這一自我獎勵與對最終輸出的監督相結合，提供了一種平衡的訓練信號，強化了視覺感知與語言推理。我們的實驗表明，Vision-SR1在多樣化的視覺語言任務中提升了視覺推理能力，減少了視覺幻覺，並降低對語言捷徑的依賴。

English

Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

基於推理分解的自獎勵視覺語言模型

Self-Rewarding Vision-Language Model via Reasoning Decomposition

摘要

Support