マルチモーダル推論のための知覚認識型ポリシー最適化

要旨

検証可能な報酬を用いた強化学習（Reinforcement Learning with Verifiable Rewards, RLVR）は、大規模言語モデル（Large Language Models, LLMs）に堅牢な多段階推論能力を付与するための非常に効果的な戦略であることが証明されている。しかし、その設計と最適化は純粋にテキスト領域に特化しており、マルチモーダル推論タスクに適用した場合には最適な性能を発揮しない。特に、現在のマルチモーダル推論における主要なエラーの原因は、視覚的入力の知覚にあることが観察される。このボトルネックに対処するため、本論文では、Perception-Aware Policy Optimization（PAPO）を提案する。これは、GRPOのシンプルでありながら効果的な拡張であり、モデルが推論を学ぶ過程で同時に知覚を学ぶことを促し、完全に内部の監視信号から学習を行う。注目すべきは、PAPOが追加のデータキュレーション、外部の報酬モデル、またはプロプライエタリなモデルに依存しない点である。具体的には、GRPOの目的関数にKLダイバージェンス項としてImplicit Perception Lossを導入し、そのシンプルさにもかかわらず、多様なマルチモーダルベンチマークで全体として4.4%の大幅な改善をもたらす。視覚依存度の高いタスクでは、その改善は8.0%に近づく。また、知覚エラーの大幅な減少（30.5%）も観察され、PAPOによる知覚能力の向上が示唆される。PAPOの包括的な分析を行い、独自の損失ハッキング問題を特定し、Double Entropy Lossを通じて厳密に分析・緩和する。全体として、本研究は、知覚を意識した監視をRLVR学習目的に深く統合し、視覚に基づいた推論を促す新しいRLフレームワークの基盤を築くものである。プロジェクトページ: https://mikewangwzhl.github.io/PAPO。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

マルチモーダル推論のための知覚認識型ポリシー最適化

Perception-Aware Policy Optimization for Multimodal Reasoning

要旨

Support