重新思考LLM推理中强化学习的采样标准:从能力-难度匹配的视角出发
Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective
May 23, 2025
作者: Deyang Kong, Qi Guo, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye
cs.AI
摘要
强化学习在提升大型语言模型的推理能力方面展现出潜力,但由于在推演阶段的样本效率较低,其扩展性受到限制。现有方法尝试通过基于问题难度进行调度来提高效率,然而这些方法对问题难度的估计不稳定且存在偏差,未能捕捉到强化学习训练中模型能力与问题难度之间的匹配关系,导致结果不尽如人意。为解决这些局限,本文提出了能力-难度对齐采样(Competence-Difficulty Alignment Sampling, CDAS),通过聚合问题的历史表现差异,实现对问题难度的准确且稳定估计。随后,模型能力被量化,利用定点系统自适应地选择与模型当前能力相匹配的难度问题。在一系列具有挑战性的数学基准测试中,实验结果表明CDAS在准确性和效率上均取得了显著提升。CDAS相较于基线方法达到了最高的平均准确率,并且与DAPO中的竞争策略动态采样相比,展现出显著的速度优势,后者比CDAS慢2.33倍。
English
Reinforcement learning exhibits potential in enhancing the reasoning
abilities of large language models, yet it is hard to scale for the low sample
efficiency during the rollout phase. Existing methods attempt to improve
efficiency by scheduling problems based on problem difficulties. However, these
approaches suffer from unstable and biased estimations of problem difficulty
and fail to capture the alignment between model competence and problem
difficulty in RL training, leading to suboptimal results. To tackle these
limitations, this paper introduces Competence-Difficulty
Alignment Sampling (CDAS), which enables accurate
and stable estimation of problem difficulties by aggregating historical
performance discrepancies of problems. Then the model competence is quantified
to adaptively select problems whose difficulty is in alignment with the model's
current competence using a fixed-point system. Experimental results across a
range of challenging mathematical benchmarks show that CDAS achieves great
improvements in both accuracy and efficiency. CDAS attains the highest average
accuracy against baselines and exhibits significant speed advantages compared
to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33
times slower than CDAS.Summary
AI-Generated Summary