ChatPaper.aiChatPaper

多模態代理推理的代理探索性策略優化

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

May 27, 2026
作者: Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee
cs.AI

摘要

具備延伸推理能力的視覺語言模型能成功解決複雜問題,但許多實際問題需要外部工具,僅靠內部推理往往無法解決。因此,代理推理交織兩種結構不對稱的行為:思考(自給自足的預設模式)與工具使用(高變異性的輔助行動)。我們將此不對稱稱為「思考-行動落差」。在標準強化學習策略(如GRPO)下,此落差在訓練過程中表現為兩種診斷性症狀:工具使用僅在大約30%的推演中被嘗試,且當被嘗試時,組內的工具使用推演在大約40%的問題上全部錯誤,從而抑制了需要學習訊號的工具呼叫處的學習訊號。我們提出AXPO(代理探索性策略優化):對於每個全部錯誤的工具使用子群組,AXPO固定思考前綴,重新取樣工具呼叫及其後續內容,並搭配基於不確定性的前綴選擇。在九個多模態基準測試與三種規模的Qwen3-VL-Thinking上,SFT+AXPO平均優於SFT+GRPO(8B模型平均Pass@1提升1.8個百分點,Pass@4提升1.8個百分點),且8B模型的SFT+AXPO在Pass@4上以4倍少的參數量超越了32B基礎模型。
English
Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.