Reinforce-Ada: 강화 학습 스타일 LLM 훈련을 위한 적응형 샘플링 프레임워크

초록

추론 작업을 위해 대규모 언어 모델(LLM)에 강화 학습을 적용할 때, 프롬프트 전반에 걸쳐 고정적이고 균일한 응답 샘플링으로 인해 불안정한 그래디언트 추정이 병목 현상을 일으키는 경우가 많습니다. GVM-RAFT와 같은 기존 연구는 예산 제약 하에서 확률적 그래디언트 분산을 최소화하기 위해 프롬프트별로 추론 예산을 동적으로 할당함으로써 이 문제를 해결했습니다. 이러한 통찰에 영감을 받아, 우리는 Reinforce-Ada를 제안합니다. 이는 LLM의 온라인 RL 사후 훈련을 위한 적응형 샘플링 프레임워크로, 가장 불확실성이 크거나 학습 잠재력이 높은 프롬프트에 샘플링 노력을 지속적으로 재할당합니다. 기존의 두 단계 할당 방법과 달리, Reinforce-Ada는 온라인 연속 제거 프로세스에서 추정과 샘플링을 교차적으로 수행하며, 충분한 신호가 수집되면 자동으로 해당 프롬프트에 대한 샘플링을 중단합니다. 업데이트를 안정화하기 위해, 우리는 강제된 보상 다양성을 가진 고정 크기 그룹을 형성하고, 적응형 샘플링 단계에서 집계된 전역 통계를 사용하여 이점 기준선을 계산합니다. 다양한 모델 아키텍처와 추론 벤치마크에서의 실험 결과는 Reinforce-Ada가 GRPO에 비해 수렴 속도를 가속화하고 최종 성능을 향상시키며, 특히 균형 잡힌 샘플링 변형을 사용할 때 더 큰 효과를 보임을 보여줍니다. 우리의 연구는 추론 능력을 갖춘 LLM을 위한 효율적이고 신뢰할 수 있는 강화 학습을 가능하게 하는 데 있어 분산 인식적이고 적응형 데이터 큐레이션의 중심적인 역할을 강조합니다. 코드는 https://github.com/RLHFlow/Reinforce-Ada에서 확인할 수 있습니다.

English

Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

Reinforce-Ada: 강화 학습 스타일 LLM 훈련을 위한 적응형 샘플링 프레임워크

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

초록

Support