擴展大型語言模型代理的測試時計算資源
Scaling Test-time Compute for LLM Agents
June 15, 2025
作者: King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou
cs.AI
摘要
擴展測試時計算資源已顯著提升大型語言模型(LLMs)的推理能力。在本研究中,我們首次系統性地探索了將測試時擴展方法應用於語言代理,並深入探討了其在多大程度上提升了這些代理的效能。具體而言,我們探討了多種測試時擴展策略,包括:(1)平行採樣算法;(2)序列修正策略;(3)驗證器與結果合併方法;(4)多樣化探索策略。我們細緻分析並消融了不同設計策略對語言代理應用測試時擴展的影響,得出以下發現:1. 擴展測試時計算資源確實能提升代理的表現。2. 對代理而言,知曉何時進行反思至關重要。3. 在各種驗證與結果合併方法中,列表式方法表現最佳。4. 增加多樣化探索對代理的任務執行有積極影響。
English
Scaling test time compute has shown remarkable success in improving the
reasoning abilities of large language models (LLMs). In this work, we conduct
the first systematic exploration of applying test-time scaling methods to
language agents and investigate the extent to which it improves their
effectiveness. Specifically, we explore different test-time scaling strategies,
including: (1) parallel sampling algorithms; (2) sequential revision
strategies; (3) verifiers and merging methods; (4)strategies for diversifying
rollouts.We carefully analyze and ablate the impact of different design
strategies on applying test-time scaling on language agents, and have follow
findings: 1. Scaling test time compute could improve the performance of agents.
2. Knowing when to reflect is important for agents. 3. Among different
verification and result merging approaches, the list-wise method performs best.
4. Increasing diversified rollouts exerts a positive effect on the agent's task
performance.