通過預算相對策略優化實現隨時推理的優化
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
May 19, 2025
作者: Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
cs.AI
摘要
擴展測試時計算能力對於提升大型語言模型(LLMs)的推理能力至關重要。現有方法通常採用強化學習(RL)來最大化在推理軌跡結束時獲得的可驗證獎勵。然而,此類方法僅在固定且大量的令牌預算下優化最終性能,這阻礙了訓練和部署的效率。在本研究中,我們提出了一種新框架——AnytimeReasoner,以優化隨時推理性能,旨在提高令牌效率及在不同令牌預算約束下的推理靈活性。為實現這一目標,我們將完整的思考過程截斷以適應從先驗分佈中抽樣的令牌預算,迫使模型為每個截斷的思考總結出最佳答案以供驗證。這將可驗證的密集獎勵引入推理過程,促進了RL優化中更有效的信用分配。隨後,我們以解耦的方式優化思考與總結策略,以最大化累積獎勵。此外,我們引入了一種新穎的方差減少技術——預算相對策略優化(BRPO),以在強化思考策略時提升學習過程的魯棒性和效率。數學推理任務中的實證結果表明,我們的方法在各種先驗分佈下,於所有思考預算中均一致優於GRPO,顯著提升了訓練和令牌效率。
English
Scaling test-time compute is crucial for enhancing the reasoning capabilities
of large language models (LLMs). Existing approaches typically employ
reinforcement learning (RL) to maximize a verifiable reward obtained at the end
of reasoning traces. However, such methods optimize only the final performance
under a large and fixed token budget, which hinders efficiency in both training
and deployment. In this work, we present a novel framework, AnytimeReasoner, to
optimize anytime reasoning performance, which aims to improve token efficiency
and the flexibility of reasoning under varying token budget constraints. To
achieve this, we truncate the complete thinking process to fit within sampled
token budgets from a prior distribution, compelling the model to summarize the
optimal answer for each truncated thinking for verification. This introduces
verifiable dense rewards into the reasoning process, facilitating more
effective credit assignment in RL optimization. We then optimize the thinking
and summary policies in a decoupled manner to maximize the cumulative reward.
Additionally, we introduce a novel variance reduction technique, Budget
Relative Policy Optimization (BRPO), to enhance the robustness and efficiency
of the learning process when reinforcing the thinking policy. Empirical results
in mathematical reasoning tasks demonstrate that our method consistently
outperforms GRPO across all thinking budgets under various prior distributions,
enhancing both training and token efficiency.Summary
AI-Generated Summary