計画と予算：大規模言語モデルの推論における効果的かつ効率的なテストタイムスケーリング

要旨

大規模言語モデル（LLM）は複雑な推論タスクにおいて顕著な成功を収めているが、その推論プロセスは依然として計算効率が低い。我々は、多くの一般的なLLMに見られる共通の失敗モードである「過剰思考（overthinking）」を観察した。これは、モデルが単純なクエリに対しても冗長で本筋から外れた推論過程を生成する現象である。最近の研究では、固定のトークン予算を強制することでこれを緩和しようと試みているが、これは特に難しい問題において「思考不足（underthinking）」を引き起こす可能性がある。実証分析を通じて、この非効率性はしばしば不明確な問題解決戦略に起因することを特定した。これを形式化するため、我々はベイジアン予算配分モデル（BBAM）を開発し、推論を不確実性が異なる一連のサブ質問としてモデル化し、正確性と計算効率のトレードオフを捉えるためのE^3メトリックを導入した。BBAMの理論的結果に基づき、我々はPlan-and-Budgetを提案する。これはモデルに依存しないテスト時フレームワークであり、複雑なクエリをサブ質問に分解し、適応的スケジューリングを用いて推定された複雑度に基づいてトークン予算を割り当てる。Plan-and-Budgetは、様々なタスクとモデルにおいて推論効率を向上させ、最大で+70%の精度向上、-39%のトークン削減、およびE^3において+187.5%の改善を達成した。特に、より小規模なモデル（DS-Qwen-32B）を、より大規模なモデル（DS-LLaMA-70B）の効率に匹敵するレベルまで引き上げることに成功し、再学習なしで性能ギャップを埋めるPlan-and-Budgetの能力を実証した。我々のコードはanonymous.4open.science/r/P-and-B-6513/で公開されている。

English

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E^3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in E^3. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at anonymous.4open.science/r/P-and-B-6513/.

計画と予算：大規模言語モデルの推論における効果的かつ効率的なテストタイムスケーリング

Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

要旨

Support