マルチモーダルエージェント推論のためのエージェント探索的方策最適化

要旨

拡張された推論を備えた視覚言語モデルは複雑な問題で成功を収めるが、多くの実世界の問題では内部推論だけでは解決できない外部ツールを必要とする。そのため、エージェント的推論は構造的不対称性を持つ二つの振る舞い、すなわち思考（自己完結型のデフォルト）とツール使用（高分散な補助的行動）をインターリーブする。我々はこの不対称性を思考-行動ギャップと呼ぶ。GRPOのような標準的な強化学習手法では、このギャップは訓練中に二つの診断的症状として現れる。すなわち、ツール使用はロールアウトの約30%でのみ試行され、試行された場合でも、グループ内のツール使用ロールアウトは質問の約40%で全問不正解となり、学習信号が必要とされるツール呼び出しの箇所で抑制される。我々はAXPO（Agent eXplorative Policy Optimization）を提案する。AXPOは、各全問不正解のツール使用サブグループに対し、思考プレフィックスを固定し、ツール呼び出しとその継続を再サンプリングし、それを不確実性に基づくプレフィックス選択と組み合わせる。9つのマルチモーダルベンチマークと3つの規模のQwen3-VL-Thinkingにおいて、SFT+AXPOは平均でSFT+GRPOを上回り（8Bで平均+1.8ppのPass@1、+1.8ppのPass@4）、8BのSFT+AXPOは32B BaseのPass@4を4分の1のパラメータで上回る。

English

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.