多模态智能体推理的智能体探索性策略优化
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
May 27, 2026
作者: Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee
cs.AI
摘要
具备扩展推理能力的视觉-语言模型能够解决复杂问题,但许多现实世界的问题需要依赖外部工具,而仅靠内部推理往往无法解决。因此,智能体推理将两种行为以结构性不对称的方式交织在一起:思考(默认的自包含行为)与工具使用(一种高方差辅助性行为)。我们将这种不对称称为“思考-行动鸿沟”。在诸如GRPO等标准强化学习策略下,这一鸿沟会在训练中表现为两个诊断性症状:工具使用仅出现在约30%的展开中,而当尝试使用时,分组中约40%的问题对应的工具使用展开全部错误,从而抑制了本应作用于工具调用的学习信号。我们提出AXPO(智能体探索式策略优化):对于每个全部错误的工具使用子组,AXPO固定思考前缀,重新采样工具调用及其后续内容,并辅以基于不确定性的前缀选择。在九个多模态基准测试和三个规模的Qwen3-VL-Thinking模型上,SFT+AXPO在平均性能上优于SFT+GRPO(8B模型平均Pass@1提升1.8个百分点,Pass@4提升1.8个百分点),并且8B规模的SFT+AXPO在Pass@4上超越了32B基础模型,参数量仅为后者的四分之一。
English
Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.