制约代理系统效能的因素有哪些?
What Limits Agentic Systems Efficiency?
October 18, 2025
作者: Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman
cs.AI
摘要
大型语言模型(LLMs),如OpenAI-o1与DeepSeek-R1,已展现出卓越的推理能力。为进一步提升LLM性能,近期诸如深度研究等代理系统,将网络交互融入LLM推理过程,以降低不确定性并减少潜在错误。然而,现有研究多聚焦于推理效能,常忽视代理系统的效率问题。本研究通过一项全面的实证分析,揭示了网络交互型代理系统中的效率瓶颈。我们将端到端延迟分解为两大主要部分:LLM API延迟与网络环境延迟。通过对15种模型及5家供应商的广泛实证研究,我们发现基于API的代理系统存在高度变异性。特别地,网络环境延迟在基于网络的代理系统中可占总延迟的53.7%。为优化延迟,我们提出了SpecCache,一个结合了推测执行的缓存框架,旨在减少网络环境开销。在两项标准基准测试上的大量评估表明,相较于随机缓存策略,我们的方法将缓存命中率提升至最高58倍,同时将网络环境开销降低至最多3.2倍,且未损害代理系统的性能。
English
Large Language Models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have
demonstrated strong reasoning capabilities. To further enhance LLM
capabilities, recent agentic systems, such as Deep Research, incorporate web
interactions into LLM reasoning to mitigate uncertainties and reduce potential
errors. However, existing research predominantly focuses on reasoning
performance, often neglecting the efficiency of agentic systems. In this work,
we present a comprehensive empirical study that identifies efficiency
bottlenecks in web-interactive agentic systems. We decompose end-to-end latency
into two primary components: LLM API latency and web environment latency. We
conduct a comprehensive empirical study across 15 models and 5 providers to
demonstrate high variability in API-based agentic systems. We observe that web
environment latency can contribute as much as 53.7% to the overall latency in a
web-based agentic system. To improve latency, we propose SpecCache, a caching
framework augmented with speculative execution that can reduce web environment
overhead. Extensive evaluations on two standard benchmarks show that our
approach improves the cache hit rate by up to 58x compared to a random caching
strategy, while reducing web environment overhead by up to 3.2x, without
degrading agentic system performance.