面向多模态推理的感知感知策略优化

摘要

可验证奖励的强化学习（RLVR）已被证明是赋予大型语言模型（LLMs）强大多步推理能力的有效策略。然而，其设计与优化仍局限于纯文本领域，导致在多模态推理任务中表现欠佳。特别是，我们观察到当前多模态推理中的主要误差来源在于视觉输入的感知。为解决这一瓶颈，我们提出了感知感知策略优化（PAPO），这是对GRPO的一个简单而有效的扩展，鼓励模型在推理的同时学习感知，完全依赖于内部监督信号。值得注意的是，PAPO不依赖额外的数据整理、外部奖励模型或专有模型。具体而言，我们在GRPO目标中引入了以KL散度项形式存在的隐式感知损失，尽管其简单，但在多样化的多模态基准测试中带来了显著的总体提升（4.4%）。在视觉依赖性高的任务中，提升更为显著，接近8.0%。我们还观察到感知误差大幅减少（30.5%），表明PAPO提升了感知能力。我们对PAPO进行了全面分析，发现了一个独特的损失欺骗问题，并通过双重熵损失进行了严格分析和缓解。总体而言，我们的工作将感知感知监督更深层次地整合到RLVR学习目标中，为鼓励视觉基础推理的新RL框架奠定了基础。项目页面：https://mikewangwzhl.github.io/PAPO。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

面向多模态推理的感知感知策略优化

Perception-Aware Policy Optimization for Multimodal Reasoning

摘要

Support