CoBA-RL:面向大型語言模型強化學習的能力導向預算分配
CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs
February 3, 2026
作者: Zhiyuan Yao, Yi-Kai Zhang, Yuxin Chen, Yueqing Sun, Zishan Xu, Yu Yang, Tianhao Hu, Qi Gu, Hui Su, Xunliang Cai
cs.AI
摘要
可驗證獎勵強化學習(RLVR)已成為提升大型語言模型推理能力的關鍵方法。然而,標準框架如群組相對策略優化(GRPO)通常採用均勻的推演預算分配,導致資源效率低下。現有自適應方法雖常依賴實例級指標(如任務通過率),卻未能捕捉模型的動態學習狀態。為解決這些侷限性,我們提出CoBA-RL演算法,該強化學習方法能根據模型演化能力自適應分配推演預算。具體而言,CoBA-RL利用面向能力的價值函數將任務映射至其潛在訓練收益,並採用基於堆疊的貪婪策略,將計算資源高效自校準分配至高訓練價值樣本。大量實驗表明,我們的方法能有效權衡探索與利用,在多個具挑戰性的基準測試中實現持續的泛化效能提升。這些發現證實:量化樣本訓練價值並優化預算分配,對推進大型語言模型後訓練效率具有關鍵作用。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM reasoning.However, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model's dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.