重新审视多模态思维链中的令牌级策略优化

摘要

多模态思维链推理要求大型视觉语言模型构建感知锚定与多步推理交错进行的推理轨迹。然而，现有可验证奖励强化学习方法通常在粗粒度层面优化推理，将思维链统一对待而未区分其视觉锚定程度的差异。本研究通过多模态推理轨迹的令牌级分析发现，成功推理的特征在于能同时反映感知锚定与探索性推理的结构化令牌动态。基于此分析，我们提出感知-探索策略优化方法：通过隐状态相似性推导感知先验，并利用平滑门控机制将其与令牌熵融合以生成令牌级优势值。该方法可与GRPO、DAPO等现有可验证奖励强化学习框架无缝集成，无需额外监督或辅助分支。在涵盖几何推理、视觉定位、视觉谜题求解和少样本分类的多样化多模态基准测试中，该方法相较强基线模型展现出持续稳健的性能提升，同时保持稳定的训练动态。代码地址：https://github.com/xzxxntxdy/PEPO

English

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

重新审视多模态思维链中的令牌级策略优化

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

摘要

Support