重新思考LLM推理中强化学习的采样标准：从能力-难度匹配的视角出发

摘要

强化学习在提升大型语言模型的推理能力方面展现出潜力，但由于在推演阶段的样本效率较低，其扩展性受到限制。现有方法尝试通过基于问题难度进行调度来提高效率，然而这些方法对问题难度的估计不稳定且存在偏差，未能捕捉到强化学习训练中模型能力与问题难度之间的匹配关系，导致结果不尽如人意。为解决这些局限，本文提出了能力-难度对齐采样（Competence-Difficulty Alignment Sampling, CDAS），通过聚合问题的历史表现差异，实现对问题难度的准确且稳定估计。随后，模型能力被量化，利用定点系统自适应地选择与模型当前能力相匹配的难度问题。在一系列具有挑战性的数学基准测试中，实验结果表明CDAS在准确性和效率上均取得了显著提升。CDAS相较于基线方法达到了最高的平均准确率，并且与DAPO中的竞争策略动态采样相比，展现出显著的速度优势，后者比CDAS慢2.33倍。

English

Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces Competence-Difficulty Alignment Sampling (CDAS), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33 times slower than CDAS.

重新思考LLM推理中强化学习的采样标准：从能力-难度匹配的视角出发

Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

摘要

Support