適応的カリキュラム学習による効率的な強化学習のファインチューニング

要旨

強化学習によるファインチューニング（Reinforcement Finetuning, RFT）は、大規模言語モデル（LLMs）の数学的推論能力を向上させる大きな可能性を示していますが、多くの場合、サンプル効率と計算効率が低く、広範なトレーニングを必要とします。本研究では、適応カリキュラム学習を通じてRFTの効率性と最終的な精度を大幅に改善するAdaRFT（Adaptive Curriculum Reinforcement Finetuning）を提案します。AdaRFTは、モデルの最近の報酬信号に基づいてトレーニング問題の難易度を動的に調整し、モデルが常に挑戦的だが解決可能なタスクでトレーニングを行うことを保証します。この適応サンプリング戦略により、最適な難易度範囲を維持することで学習を加速し、簡単すぎる問題や難しすぎる問題での計算リソースの無駄を回避します。AdaRFTは、Proximal Policy Optimization（PPO）のような標準的なRFTアルゴリズムに軽量な拡張を加えるだけで、報酬関数やモデルアーキテクチャを変更する必要はありません。AMC、AIME、IMOスタイルの問題を含む競技レベルの数学データセットでの実験により、AdaRFTがトレーニング効率と推論性能の両方を大幅に向上させることが実証されています。複数のデータ分布とモデルサイズにわたってAdaRFTを評価し、トレーニングステップ数を最大2倍削減し、精度を大幅に向上させることで、よりスケーラブルで効果的なRFTフレームワークを提供することを示しました。

English

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

適応的カリキュラム学習による効率的な強化学習のファインチューニング

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

要旨

Support