注意力作為指南針：推理模型中過程監督強化學習的高效探索

摘要

強化學習（Reinforcement Learning, RL）在提升大型語言模型（Large Language Models, LLMs）的推理能力方面展現了顯著成效。相較於基於結果的RL，過程監督式強化學習（Process-Supervised RL, PSRL）已成為一種更為有效的範式。然而，現有的PSRL方法在探索效率上存在限制，無論是在分支位置的選擇還是採樣方面。本文提出了一種新穎的PSRL框架（AttnRL），該框架能夠為推理模型實現高效的探索。基於初步觀察，即具有高注意力分數的步驟與推理行為相關，我們建議從高價值的位置進行分支。此外，我們開發了一種適應性採樣策略，該策略考慮了問題難度和歷史批次大小，確保整個訓練批次保持非零的優勢值。為了進一步提高採樣效率，我們為PSRL設計了一個一步離策略訓練流程。在多個具有挑戰性的數學推理基準上的廣泛實驗表明，我們的方法在性能、採樣和訓練效率方面均優於先前的方法。

English

Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.

注意力作為指南針：推理模型中過程監督強化學習的高效探索

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

摘要

Support