背包强化学习：通过优化预算分配释放大型语言模型的探索潜力

摘要

大型语言模型（LLMs）能够通过强化学习实现自我提升，在此过程中，它们生成轨迹以探索并发现更优解决方案。然而，这一探索过程计算成本高昂，常迫使现有方法为每项任务分配有限的探索预算。这种均匀分配方式导致了边缘问题：简单任务持续成功而困难任务屡屡失败，两者在广泛使用的群体相对策略优化（GRPO）训练更新中均产生零梯度。我们从探索预算分配的角度出发解决这一问题。将每项任务的探索视为具有独特“价值”与“成本”的“物品”，我们建立了与经典背包问题的联系。这一形式化使我们能够推导出一种基于模型当前学习状态自适应分配资源的最优分配规则。应用于GRPO时，我们的方法在训练期间将非零策略梯度的有效比例提高了20-40%。作为一种计算上的“免费午餐”，我们的方法能够将探索预算从学习饱和的任务重新分配到最具影响力的任务上。这使得特别具有挑战性的问题能够获得显著更大的预算（例如，93次尝试），这在均匀分配下是计算上不可行的。这些改进转化为数学推理基准上的实质性收益，平均提升2-4分，特定任务上峰值增益达9分。值得注意的是，采用传统均匀分配方式达到类似性能，约需两倍的计算资源。

English

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

背包强化学习：通过优化预算分配释放大型语言模型的探索潜力

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

摘要

Support