再看一眼,慢思细想:提升视觉-语言模型中的视觉反思能力
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
September 15, 2025
作者: Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang
cs.AI
摘要
近期,纯文本“慢思考”推理技术的突破推动了将其能力迁移至视觉-语言模型(VLMs)的努力,旨在训练视觉推理模型(VRMs)。然而,这一迁移面临关键挑战:VRMs中有效的“慢思考”需要视觉反思能力,即基于视觉信息核查推理过程的能力。通过定量分析,我们发现当前VRMs的视觉反思能力有限,其对于视觉信息的关注度随着生成回答长度的增加而迅速减弱。为应对这一挑战,我们提出了一种新型VRM——Reflection-V,它通过构建推理数据以支持冷启动学习,并结合强化学习(RL)的奖励设计,增强了视觉反思能力。首先,我们利用一个在VLMs与推理LLMs之间交互的代理,构建了以视觉为中心的推理数据,从而实现了视觉反思模式的冷启动学习。其次,在RL过程中采用基于视觉注意力的奖励模型,鼓励基于视觉信息的推理。因此,Reflection-V在多项视觉推理基准测试中展现了显著提升。此外,Reflection-V在视觉推理过程中对视觉信息的依赖更强且更一致,表明其视觉反思能力得到了有效增强。
English
Recent advances in text-only "slow-thinking" reasoning have prompted efforts
to transfer this capability to vision-language models (VLMs), for training
visual reasoning models (VRMs). owever, such transfer faces critical
challenges: Effective "slow thinking" in VRMs requires visual
reflection, the ability to check the reasoning process based on visual
information. Through quantitative analysis, we observe that current VRMs
exhibit limited visual reflection, as their attention to visual information
diminishes rapidly with longer generated responses. To address this challenge,
we propose a new VRM Reflection-V, which enhances visual reflection
based on reasoning data construction for cold-start and reward design for
reinforcement learning (RL). Firstly, we construct vision-centered reasoning
data by leveraging an agent that interacts between VLMs and reasoning LLMs,
enabling cold-start learning of visual reflection patterns. Secondly, a visual
attention based reward model is employed during RL to encourage reasoning based
on visual information. Therefore, Reflection-V demonstrates
significant improvements across multiple visual reasoning benchmarks.
Furthermore, Reflection-V maintains a stronger and more consistent
reliance on visual information during visual reasoning, indicating effective
enhancement in visual reflection capabilities.