VOGUE：視覚的不確実性による探索のガイダンスがマルチモーダル推論を向上させる

要旨

検証可能な報酬を用いた強化学習（RLVR）は大規模言語モデル（LLM）の推論能力を向上させますが、探索に関する課題は依然として残っており、これはマルチモーダルLLM（MLLM）においても同様です。既存の手法では、視覚入力を固定的で決定論的な条件として扱い、重要な曖昧さの源を見落とし、現実的な視覚的変動に対して頑健なポリシーを構築するのに苦労しています。本研究では、VOGUE（Visual Uncertainty Guided Exploration）という新しい手法を提案します。VOGUEは探索を出力（テキスト）から入力（視覚）空間にシフトし、画像を確率的な文脈として扱います。VOGUEは、「生」のブランチと「ノイズ」のブランチ間の対称KLダイバージェンスを用いて、視覚的摂動に対するポリシーの感度を定量化し、不確実性を考慮した探索のための直接的な信号を生成します。この信号は、不確実性に比例するボーナスを通じて学習目標を形成し、トークンエントロピーボーナスとアニーリングされたサンプリングスケジュールと組み合わせることで、探索と活用のバランスを効果的に取ります。GRPOフレームワーク内で2つのモデルスケール（Qwen2.5-VL-3B/7B）に実装されたVOGUEは、3つの視覚的数学ベンチマークで平均2.6%、3つの一般領域推論ベンチマークで平均3.7%のpass@1精度を向上させると同時に、pass@4の性能を向上させ、RLファインチューニングでよく見られる探索の減衰を軽減します。本研究は、視覚入力の内在的な不確実性に基づいて探索をグラウンディングすることが、マルチモーダル推論を改善するための効果的な戦略であることを示しています。

English

Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce VOGUE (Visual Uncertainty Guided Exploration), a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy's sensitivity to visual perturbations using the symmetric KL divergence between a "raw" and "noisy" branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.

VOGUE：視覚的不確実性による探索のガイダンスがマルチモーダル推論を向上させる

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

要旨

Support