VOGUE:利用視覺不確定性引導探索,提升多模態推理能力
VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning
October 1, 2025
作者: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)提升了大型語言模型(LLMs)的推理能力,但在探索方面仍存在挑戰,這一問題在多模態大型語言模型(MLLMs)中依然存在。現有方法將視覺輸入視為固定且確定的條件,忽略了關鍵的模糊性來源,難以構建對合理視覺變化具有魯棒性的策略。我們提出了VOGUE(視覺不確定性引導探索),這是一種新穎的方法,將探索從輸出(文本)空間轉移到輸入(視覺)空間。通過將圖像視為隨機上下文,VOGUE利用“原始”與“噪聲”分支之間的對稱KL散度量化策略對視覺擾動的敏感性,從而為不確定性感知探索創建直接信號。該信號通過與不確定性成比例的獎勵來塑造學習目標,結合詞元熵獎勵和退火採樣計劃,有效平衡了探索與利用。在GRPO框架下應用於兩種模型規模(Qwen2.5-VL-3B/7B),VOGUE在三個視覺數學基準測試中平均提升了2.6%的pass@1準確率,在三個通用領域推理基準測試中提升了3.7%的pass@1準確率,同時提高了pass@4性能,並緩解了RL微調中常見的探索衰減現象。我們的工作表明,基於視覺輸入固有不确定性的探索是提升多模態推理能力的有效策略。
English
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in
large language models (LLMs) but struggles with exploration, an issue that
still persists for multimodal LLMs (MLLMs). Current methods treat the visual
input as a fixed, deterministic condition, overlooking a critical source of
ambiguity and struggling to build policies robust to plausible visual
variations. We introduce VOGUE (Visual Uncertainty Guided
Exploration), a novel method that shifts exploration from the output (text)
to the input (visual) space. By treating the image as a stochastic context,
VOGUE quantifies the policy's sensitivity to visual perturbations using the
symmetric KL divergence between a "raw" and "noisy" branch, creating a direct
signal for uncertainty-aware exploration. This signal shapes the learning
objective via an uncertainty-proportional bonus, which, combined with a
token-entropy bonus and an annealed sampling schedule, effectively balances
exploration and exploitation. Implemented within GRPO on two model scales
(Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three
visual math benchmarks and 3.7% on three general-domain reasoning benchmarks,
while simultaneously increasing pass@4 performance and mitigating the
exploration decay commonly observed in RL fine-tuning. Our work shows that
grounding exploration in the inherent uncertainty of visual inputs is an
effective strategy for improving multimodal reasoning.