予算相対ポリシー最適化によるAnytime推論の最適化

要旨

テスト時の計算リソースのスケーリングは、大規模言語モデル（LLMs）の推論能力を向上させるために重要である。既存のアプローチでは、通常、推論トレースの終了時に得られる検証可能な報酬を最大化するために強化学習（RL）が用いられる。しかし、このような手法は、大きな固定トークン予算の下で最終的な性能のみを最適化するため、学習と展開の両面で効率性が阻害される。本研究では、AnytimeReasonerという新しいフレームワークを提案し、任意の時点での推論性能を最適化することで、トークン効率と変動するトークン予算制約下での推論の柔軟性を向上させることを目指す。これを実現するために、事前分布からサンプリングされたトークン予算に収まるように完全な思考プロセスを切り詰め、モデルに各切り詰められた思考に対して最適な回答を要約させ、検証を行う。これにより、推論プロセスに検証可能な密な報酬が導入され、RL最適化におけるより効果的なクレジット割り当てが可能となる。次に、思考ポリシーと要約ポリシーを分離して最適化し、累積報酬を最大化する。さらに、思考ポリシーを強化する際に学習プロセスの堅牢性と効率性を向上させるため、新しい分散削減手法であるBudget Relative Policy Optimization（BRPO）を導入する。数学的推論タスクにおける実験結果は、提案手法が様々な事前分布の下で全ての思考予算においてGRPOを一貫して上回り、学習とトークン効率の両方を向上させることを示している。

English

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

予算相対ポリシー最適化によるAnytime推論の最適化

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

要旨

Support