扩展大语言模型代理的测试时计算能力
Scaling Test-time Compute for LLM Agents
June 15, 2025
作者: King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou
cs.AI
摘要
扩展测试时计算资源在提升大型语言模型(LLMs)推理能力方面已展现出显著成效。本研究中,我们首次系统性地探索了将测试时扩展方法应用于语言代理,并深入考察了其对提升其效能的程度。具体而言,我们探讨了多种测试时扩展策略,包括:(1)并行采样算法;(2)序列修订策略;(3)验证器与结果融合方法;(4)多样化探索策略。我们细致分析并剥离了不同设计策略在语言代理上实施测试时扩展的影响,得出以下发现:1. 扩展测试时计算资源能够提升代理的性能。2. 对于代理而言,掌握何时进行反思至关重要。3. 在多种验证与结果融合方法中,列表式方法表现最佳。4. 增加多样化的探索对代理任务执行具有积极影响。
English
Scaling test time compute has shown remarkable success in improving the
reasoning abilities of large language models (LLMs). In this work, we conduct
the first systematic exploration of applying test-time scaling methods to
language agents and investigate the extent to which it improves their
effectiveness. Specifically, we explore different test-time scaling strategies,
including: (1) parallel sampling algorithms; (2) sequential revision
strategies; (3) verifiers and merging methods; (4)strategies for diversifying
rollouts.We carefully analyze and ablate the impact of different design
strategies on applying test-time scaling on language agents, and have follow
findings: 1. Scaling test time compute could improve the performance of agents.
2. Knowing when to reflect is important for agents. 3. Among different
verification and result merging approaches, the list-wise method performs best.
4. Increasing diversified rollouts exerts a positive effect on the agent's task
performance.