VOGUE:利用视觉不确定性引导探索,提升多模态推理能力
VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning
October 1, 2025
作者: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)提升了大型语言模型(LLMs)的推理能力,但在探索方面仍存在挑战,这一问题同样困扰着多模态大型语言模型(MLLMs)。现有方法将视觉输入视为固定且确定的条件,忽视了模糊性的关键来源,难以构建对合理视觉变化具有鲁棒性的策略。我们提出了VOGUE(视觉不确定性引导探索),这一新方法将探索从输出(文本)空间转向输入(视觉)空间。通过将图像视为随机上下文,VOGUE利用“原始”与“噪声”分支间的对称KL散度量化策略对视觉扰动的敏感性,为不确定性感知探索提供直接信号。该信号通过一个与不确定性成比例的奖励项塑造学习目标,结合令牌熵奖励和退火采样调度,有效平衡了探索与利用。在GRPO框架下,针对两种模型规模(Qwen2.5-VL-3B/7B)实施VOGUE,其在三个视觉数学基准测试中平均提升了2.6%的pass@1准确率,在三个通用领域推理基准测试中提升了3.7%,同时提高了pass@4性能,并缓解了RL微调中常见的探索衰减现象。我们的研究表明,将探索建立在视觉输入固有的不确定性基础上,是提升多模态推理能力的有效策略。
English
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in
large language models (LLMs) but struggles with exploration, an issue that
still persists for multimodal LLMs (MLLMs). Current methods treat the visual
input as a fixed, deterministic condition, overlooking a critical source of
ambiguity and struggling to build policies robust to plausible visual
variations. We introduce VOGUE (Visual Uncertainty Guided
Exploration), a novel method that shifts exploration from the output (text)
to the input (visual) space. By treating the image as a stochastic context,
VOGUE quantifies the policy's sensitivity to visual perturbations using the
symmetric KL divergence between a "raw" and "noisy" branch, creating a direct
signal for uncertainty-aware exploration. This signal shapes the learning
objective via an uncertainty-proportional bonus, which, combined with a
token-entropy bonus and an annealed sampling schedule, effectively balances
exploration and exploitation. Implemented within GRPO on two model scales
(Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three
visual math benchmarks and 3.7% on three general-domain reasoning benchmarks,
while simultaneously increasing pass@4 performance and mitigating the
exploration decay commonly observed in RL fine-tuning. Our work shows that
grounding exploration in the inherent uncertainty of visual inputs is an
effective strategy for improving multimodal reasoning.