Knapsack RL: 予算配分の最適化によるLLMの探索可能性の解放

要旨

大規模言語モデル（LLMs）は、強化学習を通じて自己改善を行うことが可能であり、軌跡を生成して探索し、より良い解決策を発見する。しかし、この探索プロセスは計算コストが高く、現在の手法では各タスクに限られた探索予算を割り当てざるを得ないことが多い。この均一な割り当ては、問題のあるエッジケースを生み出す：容易なタスクは一貫して成功し、困難なタスクは一貫して失敗するため、広く使用されているGroup Relative Policy Optimization（GRPO）の訓練更新中に勾配がゼロとなる。我々はこの問題を探索予算の割り当てという観点から取り組む。各タスクの探索を、異なる「価値」と「コスト」を持つ「アイテム」と見なし、古典的なナップサック問題との関連性を確立する。この定式化により、モデルの現在の学習状況に基づいてリソースを適応的に分配する最適な割り当てルールを導出することができる。GRPOに適用すると、我々の手法は訓練中に非ゼロのポリシー勾配の有効比率を20-40%増加させる。計算上の「無料のランチ」として機能するこのアプローチは、学習が飽和しているタスクから最も影響力のあるタスクへ探索予算を再分配することが可能である。これにより、特に困難な問題に対して大幅に大きな予算（例：93ロールアウト）を割り当てることができ、均一な割り当て下では計算上不可能であった。これらの改善は、数学的推論ベンチマークにおいて有意な向上をもたらし、平均で2-4ポイント、特定のタスクでは最大9ポイントの向上が見られた。特に、従来の均一な割り当てで同等の性能を達成するには、約2倍の計算リソースが必要となる。

English

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

Knapsack RL: 予算配分の最適化によるLLMの探索可能性の解放

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

要旨

Support