重新思考強化學習中大型語言模型推理的採樣標準：從能力難度對齊的視角出發

摘要

強化學習在提升大型語言模型的推理能力方面展現出潛力，然而由於在推展階段的樣本效率低下，其擴展性受到限制。現有方法試圖通過基於問題難度的調度來提高效率，但這些方法存在問題難度估計不穩定且偏差的問題，未能捕捉到強化學習訓練中模型能力與問題難度之間的對齊，導致結果次優。為解決這些限制，本文引入了能力-難度對齊採樣（Competence-Difficulty Alignment Sampling, CDAS），該方法通過聚合問題的歷史表現差異來實現問題難度的準確且穩定估計。隨後，模型能力被量化，以使用固定點系統自適應地選擇與模型當前能力相匹配的難度問題。在一系列具有挑戰性的數學基準測試中，實驗結果顯示CDAS在準確性和效率上均取得了顯著提升。CDAS相較於基準方法達到了最高的平均準確率，並與DAPO中的競爭策略動態採樣相比展現出顯著的速度優勢，後者比CDAS慢2.33倍。

English

Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces Competence-Difficulty Alignment Sampling (CDAS), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33 times slower than CDAS.

重新思考強化學習中大型語言模型推理的採樣標準：從能力難度對齊的視角出發

Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

摘要

Support