通用大语言模型代理的基准测试时扩展研究
Benchmark Test-Time Scaling of General LLM Agents
February 22, 2026
作者: Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong
cs.AI
摘要
随着LLM智能体日益被期待成为能处理开放式用户请求的通用系统,现有基准测试主要聚焦于开发专用智能体的领域感知环境,而评估通用智能体则需要更贴近现实的设定,以挑战其在统一环境中跨越多技能与工具的操作能力。我们推出General AgentBench基准测试,该框架通过整合搜索、编程、推理和工具使用四大领域,为通用LLM智能体评估提供统一平台。基于此基准,我们系统研究了顺序扩展(迭代交互)与并行扩展(多轨迹采样)两种测试时扩展模式下的性能变化。对十大主流LLM智能体的评估表明,从领域特定评估转向通用智能体设定时会出现显著的性能衰减。更重要的是,我们发现两种扩展方法在实践中均未带来有效性能提升,这源于两个根本性局限:顺序扩展中的上下文天花板效应与并行扩展中的验证鸿沟问题。代码已开源于https://github.com/cxcscmu/General-AgentBench。
English
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.