CoBA-RL:面向大语言模型强化学习的能效导向预算分配
CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs
February 3, 2026
作者: Zhiyuan Yao, Yi-Kai Zhang, Yuxin Chen, Yueqing Sun, Zishan Xu, Yu Yang, Tianhao Hu, Qi Gu, Hui Su, Xunliang Cai
cs.AI
摘要
可验证奖励强化学习(RLVR)已成为提升大语言模型推理能力的关键方法。然而,诸如群体相对策略优化(GRPO)等标准框架通常采用统一的模拟预算,导致资源利用效率低下。此外,现有自适应方法往往依赖任务通过率等实例级指标,难以捕捉模型的动态学习状态。为解决这些局限性,我们提出CoBA-RL算法——一种基于模型能力演进自适应分配模拟预算的强化学习方法。该算法通过能力导向价值函数将任务映射至潜在训练收益,并采用基于堆结构的贪心策略,高效地将计算资源自校准分配至高训练价值样本。大量实验表明,我们的方法能有效协调探索与利用的平衡,在多个挑战性基准测试中实现持续泛化提升。这些发现印证了量化样本训练价值与优化预算分配对推进大语言模型后训练效率的关键作用。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM reasoning.However, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model's dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.