通过推理分解实现自我奖励的视觉语言模型
Self-Rewarding Vision-Language Model via Reasoning Decomposition
August 27, 2025
作者: Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu
cs.AI
摘要
视觉-语言模型(VLMs)常面临视觉幻觉问题,即描述图像中并不存在的内容,以及语言捷径问题,即跳过视觉部分直接依赖文本先验。这些问题源于大多数VLMs的后训练方法仅依赖于简单的可验证答案匹配,并仅监督最终输出,导致中间视觉推理缺乏明确指导。因此,VLMs接收到的视觉信号稀疏,往往倾向于优先采用基于语言的推理而非视觉感知。为缓解这一问题,现有方法通过人工标注或从外部大模型蒸馏的标签来增加视觉监督。然而,人工标注既耗时又昂贵,且由于外部信号无法适应不断变化的策略,它们会导致分布偏移,进而引发奖励欺骗。本文提出Vision-SR1,一种通过强化学习改进视觉推理的自奖励方法,无需依赖外部视觉监督。Vision-SR1将VLM推理分解为两个阶段:视觉感知与语言推理。模型首先被提示生成自包含的视觉感知,这些感知足以回答问题而无需回溯输入图像。为验证这种自包含性,同一VLM模型随后被重新提示,仅使用生成的感知作为输入进行语言推理以计算奖励。这一自奖励与最终输出的监督相结合,提供了平衡的训练信号,强化了视觉感知与语言推理。实验表明,Vision-SR1在多种视觉-语言任务中提升了视觉推理能力,减少了视觉幻觉,并降低了对语言捷径的依赖。
English
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying
things that are not actually in the image, and language shortcuts, where they
skip the visual part and just rely on text priors. These issues arise because
most post-training methods for VLMs rely on simple verifiable answer matching
and supervise only final outputs, leaving intermediate visual reasoning without
explicit guidance. As a result, VLMs receive sparse visual signals and often
learn to prioritize language-based reasoning over visual perception. To
mitigate this, some existing methods add visual supervision using human
annotations or distilled labels from external large models. However, human
annotations are labor-intensive and costly, and because external signals cannot
adapt to the evolving policy, they cause distributional shifts that can lead to
reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method
that improves visual reasoning without relying on external visual supervisions
via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two
stages: visual perception and language reasoning. The model is first prompted
to produce self-contained visual perceptions that are sufficient to answer the
question without referring back the input image. To validate this
self-containment, the same VLM model is then re-prompted to perform language
reasoning using only the generated perception as input to compute reward. This
self-reward is combined with supervision on final outputs, providing a balanced
training signal that strengthens both visual perception and language reasoning.
Our experiments demonstrate that Vision-SR1 improves visual reasoning,
mitigates visual hallucinations, and reduces reliance on language shortcuts
across diverse vision-language tasks.