멀티모달 에이전트 추론을 위한 에이전트 탐색적 정책 최적화

초록

확장된 추론을 갖춘 시각-언어 모델은 복잡한 문제에서 성공하지만, 많은 실제 문제는 내부 추론만으로는 해결하기 어려운 외부 도구를 필요로 한다. 따라서 에이전트적 추론은 구조적 비대칭성을 가진 두 가지 행동, 즉 사고(자체 포함된 기본 행위)와 도구 사용(고분산 보조 행위)을 교차시킨다. 우리는 이러한 비대칭성을 사고-행동 간극(Thinking-Acting Gap)이라고 부른다. GRPO와 같은 표준 RL 방식에서 이 간극은 훈련 중 두 가지 진단적 증상으로 나타난다. 즉, 도구 사용은 롤아웃의 약 30%에서만 시도되며, 시도될 경우 그룹 내 도구 사용 롤아웃은 약 40%의 질문에서 모두 틀려, 학습 신호가 필요한 도구 호출 지점에서 신호가 억압된다. 우리는 AXPO(Agent eXplorative Policy Optimization)를 제안한다. 각각의 전체 오답 도구 사용 부분그룹에 대해, AXPO는 사고 프리픽스를 고정하고 도구 호출 및 그 이후를 재표집하며, 불확실성 기반 프리픽스 선택과 결합한다. 9개의 멀티모달 벤치마크와 세 가지 규모의 Qwen3-VL-Thinking에서 SFT+AXPO는 평균적으로 SFT+GRPO보다 성능이 뛰어나며(8B에서 평균 Pass@1 +1.8pp, Pass@4 +1.8pp), SFT+AXPO를 적용한 8B 모델은 4배 적은 파라미터로 32B Base 모델의 Pass@4를 능가한다.

English

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.