重新审视多模态思维链中的令牌级策略优化

摘要

多模态思维链推理要求大型视觉语言模型构建感知锚定与多步推理交织的推理轨迹。然而，现有基于可验证奖励的强化学习方法通常在粗粒度上优化推理，将思维链统一处理而未区分其视觉锚定程度的差异。本文通过多模态推理轨迹的令牌级分析发现，成功推理的典型特征在于能同时反映感知锚定与探索性推理的结构化令牌动态。基于此分析，我们提出感知-探索策略优化框架：通过隐状态相似度推导感知先验，并采用平滑门控机制将其与令牌熵融合以生成令牌级优势值。该框架可无缝集成GRPO、DAPO等现有强化学习框架，无需额外监督或辅助分支。在涵盖几何推理、视觉定位、视觉谜题求解和少样本分类的多样化多模态基准测试中，本方法相较于强强化学习基线均展现出持续稳健的性能提升，同时保持稳定的训练动态。代码地址：https://github.com/xzxxntxdy/PEPO

English

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO