LLM 추론을 위한 강화 학습의 샘플링 기준 재고: 역량-난이도 정렬 관점

초록

강화 학습은 대규모 언어 모델의 추론 능력을 향상시키는 데 있어 잠재력을 보여주지만, 롤아웃 단계에서의 낮은 샘플 효율성으로 인해 확장하기가 어렵습니다. 기존 방법들은 문제의 난이도를 기반으로 문제를 스케줄링하여 효율성을 개선하려고 시도합니다. 그러나 이러한 접근 방식은 문제 난이도의 불안정하고 편향된 추정에 시달리며, 강화 학습 훈련에서 모델 역량과 문제 난이도 간의 정렬을 포착하지 못해 최적의 결과를 얻지 못합니다. 이러한 한계를 해결하기 위해, 본 논문은 Competence-Difficulty Alignment Sampling (CDAS)을 소개합니다. CDAS는 문제의 역사적 성능 차이를 집계하여 문제 난이도를 정확하고 안정적으로 추정할 수 있게 합니다. 그런 다음 모델 역량을 정량화하여 고정점 시스템을 사용하여 모델의 현재 역량과 일치하는 난이도의 문제를 적응적으로 선택합니다. 다양한 도전적인 수학 벤치마크에서의 실험 결과는 CDAS가 정확도와 효율성 모두에서 큰 개선을 달성함을 보여줍니다. CDAS는 기준선 대비 가장 높은 평균 정확도를 달성했으며, DAPO에서 경쟁력 있는 전략인 Dynamic Sampling에 비해 2.33배 빠른 속도 이점을 보여줍니다.

English

Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces Competence-Difficulty Alignment Sampling (CDAS), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model's current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33 times slower than CDAS.