효율적인 강화 미세 조정을 위한 적응형 커리큘럼 학습

초록

강화 미세 조정(Reinforcement Finetuning, RFT)은 대규모 언어 모델(LLMs)의 수학적 추론 능력을 향상시키는 데 큰 잠재력을 보여주었지만, 종종 샘플 및 계산 비효율적이며 광범위한 훈련이 필요합니다. 본 연구에서는 적응형 커리큘럼 학습을 통해 RFT의 효율성과 최종 정확도를 크게 개선하는 AdaRFT(Adaptive Curriculum Reinforcement Finetuning) 방법을 소개합니다. AdaRFT는 모델의 최근 보상 신호를 기반으로 훈련 문제의 난이도를 동적으로 조정하여, 모델이 도전적이지만 해결 가능한 과제를 지속적으로 훈련하도록 보장합니다. 이 적응형 샘플링 전략은 최적의 난이도 범위를 유지함으로써 학습을 가속화하고, 너무 쉬운 또는 너무 어려운 문제에 대한 계산 낭비를 방지합니다. AdaRFT는 Proximal Policy Optimization(PPO)과 같은 표준 RFT 알고리즘에 경량 확장만을 필요로 하며, 보상 함수나 모델 아키텍처를 수정하지 않습니다. AMC, AIME, IMO 스타일 문제를 포함한 경쟁 수준의 수학 데이터셋에 대한 실험을 통해 AdaRFT가 훈련 효율성과 추론 성능을 크게 향상시킴을 입증합니다. 우리는 다양한 데이터 분포와 모델 크기에 걸쳐 AdaRFT를 평가하며, 훈련 단계 수를 최대 2배까지 줄이고 정확도를 상당히 개선하여 더 확장 가능하고 효과적인 RFT 프레임워크를 제공함을 보여줍니다.

English

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

효율적인 강화 미세 조정을 위한 적응형 커리큘럼 학습

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

초록

Support