感知感知策略優化於多模態推理

摘要

可驗證獎勵的強化學習（RLVR）已被證明是一種極其有效的策略，能夠賦予大型語言模型（LLMs）強大的多步推理能力。然而，其設計和優化仍主要針對純文本領域，導致在應用於多模態推理任務時表現欠佳。特別地，我們觀察到當前多模態推理中的主要誤差來源在於視覺輸入的感知。為解決這一瓶頸，我們提出了感知感知策略優化（PAPO），這是GRPO的一個簡單而有效的擴展，它鼓勵模型在學習推理的同時學習感知，完全依賴於內部監督信號。值得注意的是，PAPO不依賴於額外的數據整理、外部獎勵模型或專有模型。具體而言，我們在GRPO目標中引入了以KL散度項形式存在的隱式感知損失，儘管其形式簡單，卻在多樣化的多模態基準測試中帶來了顯著的整體提升（4.4%）。在視覺依賴性高的任務上，提升更為顯著，接近8.0%。我們還觀察到感知錯誤大幅減少（30.5%），表明PAPO提升了感知能力。我們對PAPO進行了全面分析，並發現了一個獨特的損失欺騙問題，我們通過雙熵損失對其進行了嚴格分析並加以緩解。總體而言，我們的工作將感知感知監督更深層次地整合到RLVR學習目標中，並為鼓勵視覺基礎推理的新RL框架奠定了基礎。項目頁面：https://mikewangwzhl.github.io/PAPO。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

感知感知策略優化於多模態推理

Perception-Aware Policy Optimization for Multimodal Reasoning

摘要

Support