Autellix：一個高效的LLM代理服務引擎，作為通用程式

摘要

大型語言模型（LLM）應用正從簡單的聊天機器人演進為動態、通用的代理程式，這些程式通過擴展LLM調用和輸出標記，協助AI代理進行推理、探索並解決複雜任務。然而，現有的LLM服務系統忽視了程式與調用之間的依賴關係，錯失了重要的優化機會。我們的分析顯示，提交至LLM服務引擎的程式經歷了較長的累積等待時間，這主要是由於單個LLM請求和程式層面的隊頭阻塞所致。為解決這一問題，我們引入了Autellix，這是一個將程式視為一等公民的LLM服務系統，旨在最小化其端到端延遲。Autellix攔截程式提交的LLM調用，為調度器提供程式層面的上下文信息。我們提出了兩種調度算法——針對單線程和分佈式程式——這些算法根據程式先前完成的調用，對LLM調用進行搶佔和優先級排序。我們的評估表明，在各種LLM和代理工作負載下，與最先進的系統（如vLLM）相比，Autellix在相同延遲下將程式的吞吐量提高了4至15倍。

English

Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.

Autellix：一個高效的LLM代理服務引擎，作為通用程式

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

摘要

Support