通过预算相对策略优化实现随时推理的最优化

摘要

提升测试阶段的计算规模对于增强大型语言模型（LLMs）的推理能力至关重要。现有方法通常采用强化学习（RL）来最大化在推理轨迹末端获得的可验证奖励。然而，这些方法仅在大且固定的令牌预算下优化最终性能，这限制了训练和部署的效率。在本研究中，我们提出了一个新颖的框架——AnytimeReasoner，旨在优化任意时刻的推理性能，以提高令牌效率及在不同令牌预算约束下的推理灵活性。为此，我们将完整的思考过程截断以适应从先验分布中采样的令牌预算，迫使模型为每次截断的思考总结出最优答案以供验证。这为推理过程引入了可验证的密集奖励，促进了RL优化中更有效的信用分配。随后，我们以解耦的方式优化思考策略和总结策略，以最大化累积奖励。此外，我们引入了一种新颖的方差缩减技术——预算相对策略优化（BRPO），以增强在强化思考策略时学习过程的鲁棒性和效率。在数学推理任务中的实证结果表明，我们的方法在各种先验分布下，于所有思考预算上均优于GRPO，显著提升了训练和令牌效率。

English

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

通过预算相对策略优化实现随时推理的最优化

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

摘要

Support