注意力作为指南针：推理模型中过程监督强化学习的高效探索

摘要

强化学习（Reinforcement Learning, RL）在提升大型语言模型（Large Language Models, LLMs）的推理能力方面展现了显著成效。相较于基于结果的RL，过程监督强化学习（Process-Supervised RL, PSRL）作为一种更为有效的范式崭露头角。然而，现有的PSRL方法在探索效率上存在局限，无论是分支位置的选择还是采样策略。本文提出了一种新颖的PSRL框架——AttnRL，旨在为推理模型实现高效探索。基于初步观察，即高注意力分数步骤与推理行为密切相关，我们建议从高价值位置进行分支。此外，我们开发了一种自适应采样策略，该策略综合考虑问题难度及历史批次大小，确保整个训练批次保持非零优势值。为进一步提升采样效率，我们为PSRL设计了一步离策略训练流程。在多个具有挑战性的数学推理基准上的广泛实验表明，我们的方法在性能、采样及训练效率方面均优于现有方法。

English

Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.

注意力作为指南针：推理模型中过程监督强化学习的高效探索

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

摘要

Support