プロセス監視型強化学習における効率的探索：推論モデルのためのAttentionコンパス

要旨

強化学習（Reinforcement Learning: RL）は、大規模言語モデル（Large Language Models: LLMs）の推論能力を向上させる上で顕著な成功を収めています。その中でも、プロセス監視型強化学習（Process-Supervised RL: PSRL）は、結果ベースのRLと比較してより効果的なパラダイムとして登場しました。しかし、既存のPSRLアプローチは、分岐位置とサンプリングの両面において探索効率が限定的であるという課題を抱えています。本論文では、推論モデルに対して効率的な探索を可能にする新しいPSRLフレームワーク（AttnRL）を提案します。高いアテンションスコアを示すステップが推論行動と相関するという予備的な観察に基づき、高い値を持つ位置から分岐することを提案します。さらに、問題の難易度と過去のバッチサイズを考慮した適応型サンプリング戦略を開発し、トレーニングバッチ全体が非ゼロのアドバンテージ値を維持することを保証します。サンプリング効率をさらに向上させるため、PSRL向けのワンステップオフポリシートレーニングパイプラインを設計しました。複数の難易度の高い数学的推論ベンチマークでの大規模な実験により、本手法が性能、サンプリング効率、トレーニング効率の面で従来のアプローチを一貫して上回ることが実証されました。

English

Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.

プロセス監視型強化学習における効率的探索：推論モデルのためのAttentionコンパス

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

要旨

Support