规划与预算:大型语言模型推理中测试阶段的高效扩展策略
Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning
May 22, 2025
作者: Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou
cs.AI
摘要
大型语言模型(LLMs)在复杂推理任务中取得了显著成功,但其推理过程仍存在计算效率低下的问题。我们观察到,许多主流LLMs中存在一种常见的失效模式,即“过度思考”,即使面对简单查询,模型也会生成冗长且偏离主题的推理轨迹。近期研究尝试通过设定固定的token预算来缓解这一问题,然而,这可能导致“思考不足”,尤其是在处理更复杂问题时。通过实证分析,我们发现这种低效往往源于不明确的问题解决策略。为对此进行形式化描述,我们开发了一个理论模型——贝叶斯预算分配模型(BBAM),该模型将推理过程建模为一系列具有不同不确定性的子问题序列,并引入E^3指标以捕捉正确性与计算效率之间的权衡。基于BBAM的理论成果,我们提出了“规划与预算”框架,这是一个模型无关的测试时框架,能够将复杂查询分解为子问题,并根据自适应调度估计的复杂性分配token预算。“规划与预算”框架在一系列任务和模型中提升了推理效率,实现了高达+70%的准确率提升、-39%的token减少以及E^3指标187.5%的改进。尤为值得一提的是,该框架使较小模型(DS-Qwen-32B)的效率提升至与较大模型(DS-LLaMA-70B)相当,展示了“规划与预算”在不重新训练的情况下弥合性能差距的能力。我们的代码已公开于anonymous.4open.science/r/P-and-B-6513/。
English
Large Language Models (LLMs) have achieved remarkable success in complex
reasoning tasks, but their inference remains computationally inefficient. We
observe a common failure mode in many prevalent LLMs, overthinking, where
models generate verbose and tangential reasoning traces even for simple
queries. Recent works have tried to mitigate this by enforcing fixed token
budgets, however, this can lead to underthinking, especially on harder
problems. Through empirical analysis, we identify that this inefficiency often
stems from unclear problem-solving strategies. To formalize this, we develop a
theoretical model, BBAM (Bayesian Budget Allocation Model), which models
reasoning as a sequence of sub-questions with varying uncertainty, and
introduce the E^3 metric to capture the trade-off between correctness and
computation efficiency. Building on theoretical results from BBAM, we propose
Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex
queries into sub-questions and allocates token budgets based on estimated
complexity using adaptive scheduling. Plan-and-Budget improves reasoning
efficiency across a range of tasks and models, achieving up to +70% accuracy
gains, -39% token reduction, and +187.5% improvement in E^3. Notably, it
elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger
model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close
performance gaps without retraining. Our code is available at
anonymous.4open.science/r/P-and-B-6513/.