何者限制了代理系統的效率？

摘要

大型語言模型（LLMs），如OpenAI-o1和DeepSeek-R1，已展現出強大的推理能力。為了進一步提升LLM的能力，近期的代理系統，如Deep Research，將網絡互動整合到LLM的推理過程中，以減少不確定性並降低潛在錯誤。然而，現有研究主要集中於推理性能，往往忽視了代理系統的效率。在本研究中，我們進行了一項全面的實證研究，識別了網絡互動代理系統中的效率瓶頸。我們將端到端延遲分解為兩個主要部分：LLM API延遲和網絡環境延遲。我們對15個模型和5個供應商進行了全面的實證研究，展示了基於API的代理系統中存在的高度變異性。我們觀察到，在基於網絡的代理系統中，網絡環境延遲可佔總延遲的53.7%。為了改善延遲，我們提出了SpecCache，這是一個結合了推測執行的緩存框架，能夠減少網絡環境的開銷。在兩個標準基準上的廣泛評估顯示，與隨機緩存策略相比，我們的方法將緩存命中率提高了最多58倍，同時將網絡環境開銷降低了最多3.2倍，且未降低代理系統的性能。

English

Large Language Models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated strong reasoning capabilities. To further enhance LLM capabilities, recent agentic systems, such as Deep Research, incorporate web interactions into LLM reasoning to mitigate uncertainties and reduce potential errors. However, existing research predominantly focuses on reasoning performance, often neglecting the efficiency of agentic systems. In this work, we present a comprehensive empirical study that identifies efficiency bottlenecks in web-interactive agentic systems. We decompose end-to-end latency into two primary components: LLM API latency and web environment latency. We conduct a comprehensive empirical study across 15 models and 5 providers to demonstrate high variability in API-based agentic systems. We observe that web environment latency can contribute as much as 53.7% to the overall latency in a web-based agentic system. To improve latency, we propose SpecCache, a caching framework augmented with speculative execution that can reduce web environment overhead. Extensive evaluations on two standard benchmarks show that our approach improves the cache hit rate by up to 58x compared to a random caching strategy, while reducing web environment overhead by up to 3.2x, without degrading agentic system performance.

何者限制了代理系統的效率？

What Limits Agentic Systems Efficiency?

摘要

Support