VOGUE：利用視覺不確定性引導探索，提升多模態推理能力

摘要

基於可驗證獎勵的強化學習（RLVR）提升了大型語言模型（LLMs）的推理能力，但在探索方面仍存在挑戰，這一問題在多模態大型語言模型（MLLMs）中依然存在。現有方法將視覺輸入視為固定且確定的條件，忽略了關鍵的模糊性來源，難以構建對合理視覺變化具有魯棒性的策略。我們提出了VOGUE（視覺不確定性引導探索），這是一種新穎的方法，將探索從輸出（文本）空間轉移到輸入（視覺）空間。通過將圖像視為隨機上下文，VOGUE利用“原始”與“噪聲”分支之間的對稱KL散度量化策略對視覺擾動的敏感性，從而為不確定性感知探索創建直接信號。該信號通過與不確定性成比例的獎勵來塑造學習目標，結合詞元熵獎勵和退火採樣計劃，有效平衡了探索與利用。在GRPO框架下應用於兩種模型規模（Qwen2.5-VL-3B/7B），VOGUE在三個視覺數學基準測試中平均提升了2.6%的pass@1準確率，在三個通用領域推理基準測試中提升了3.7%的pass@1準確率，同時提高了pass@4性能，並緩解了RL微調中常見的探索衰減現象。我們的工作表明，基於視覺輸入固有不确定性的探索是提升多模態推理能力的有效策略。

English

Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce VOGUE (Visual Uncertainty Guided Exploration), a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy's sensitivity to visual perturbations using the symmetric KL divergence between a "raw" and "noisy" branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.

VOGUE：利用視覺不確定性引導探索，提升多模態推理能力

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

摘要

Support