规划与预算:大规模语言模型推理中高效且经济的测试阶段扩展策略
Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning
May 22, 2025
作者: Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou
cs.AI
摘要
大型语言模型(LLMs)在复杂推理任务中取得了显著成功,但其推理过程仍存在计算效率低下的问题。我们观察到许多主流LLMs中存在一种常见的失效模式——过度思考,即模型即使面对简单查询也会生成冗长且偏离主题的推理轨迹。近期研究尝试通过设定固定的token预算来缓解这一问题,然而这可能导致在较难问题上出现思考不足的情况。通过实证分析,我们发现这种低效往往源于不明确的问题解决策略。为对此进行形式化,我们提出了一个理论模型——贝叶斯预算分配模型(BBAM),该模型将推理建模为一系列具有不同不确定性的子问题序列,并引入E^3指标来捕捉正确性与计算效率之间的权衡。基于BBAM的理论成果,我们提出了“规划与预算”(Plan-and-Budget),这是一个模型无关的测试时框架,它通过自适应调度将复杂查询分解为子问题,并根据估计的复杂性分配token预算。Plan-and-Budget在一系列任务和模型中提升了推理效率,实现了高达+70%的准确率提升、-39%的token减少以及E^3指标187.5%的改进。尤为突出的是,它使一个较小模型(DS-Qwen-32B)达到了与更大模型(DS-LLaMA-70B)相当的效率,展示了Plan-and-Budget无需重新训练即可缩小性能差距的能力。我们的代码已公开于anonymous.4open.science/r/P-and-B-6513/。
English
Large Language Models (LLMs) have achieved remarkable success in complex
reasoning tasks, but their inference remains computationally inefficient. We
observe a common failure mode in many prevalent LLMs, overthinking, where
models generate verbose and tangential reasoning traces even for simple
queries. Recent works have tried to mitigate this by enforcing fixed token
budgets, however, this can lead to underthinking, especially on harder
problems. Through empirical analysis, we identify that this inefficiency often
stems from unclear problem-solving strategies. To formalize this, we develop a
theoretical model, BBAM (Bayesian Budget Allocation Model), which models
reasoning as a sequence of sub-questions with varying uncertainty, and
introduce the E^3 metric to capture the trade-off between correctness and
computation efficiency. Building on theoretical results from BBAM, we propose
Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex
queries into sub-questions and allocates token budgets based on estimated
complexity using adaptive scheduling. Plan-and-Budget improves reasoning
efficiency across a range of tasks and models, achieving up to +70% accuracy
gains, -39% token reduction, and +187.5% improvement in E^3. Notably, it
elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger
model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close
performance gaps without retraining. Our code is available at
anonymous.4open.science/r/P-and-B-6513/.