適応的推論のための階層的予算ポリシー最適化

要旨

大規模な推論モデルは、広範な連鎖思考生成を通じて顕著な性能を達成する一方で、問題の複雑さに関わらず均一な推論戦略を適用するため、計算効率の面で大きな非効率性を示します。本論文では、Hierarchical Budget Policy Optimization (HBPO) を提案します。これは、モデルが問題固有の推論深度を学習できるようにする強化学習フレームワークであり、能力を犠牲にすることなく効率性を向上させます。HBPOは、効率指向のトレーニングにおける探索空間の崩壊という根本的な課題に対処します。この課題では、長い出力に対するペナルティが、必要な長い推論パスからモデルを系統的に遠ざけてしまいます。階層的な予算探索を通じて、我々のアプローチはロールアウトサンプルを異なるトークン予算を持つ複数のサブグループに分割し、能力の低下を防ぎながら効率的なリソース割り当てを可能にします。また、問題の複雑さに応じた予算認識型のインセンティブを提供する差別化された報酬メカニズムを導入し、モデルがタスク要件と計算努力の間の自然な対応関係を発見できるようにします。大規模な実験により、HBPOが4つの推論ベンチマークにおいて平均トークン使用量を最大60.6%削減し、精度を3.14%向上させることが示されました。既存の手法が外部制約を課したり離散的なモード選択に依存するのとは異なり、HBPOは問題の複雑さに基づいてモデルが自動的に推論深度を調整する適応的な振る舞いを示します。我々の結果は、推論効率と能力が本質的に相反するものではなく、探索の多様性を維持する適切に構造化された階層的トレーニングを通じて同時に最適化できることを示唆しています。

English

Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

適応的推論のための階層的予算ポリシー最適化

Hierarchical Budget Policy Optimization for Adaptive Reasoning

要旨

Support