KVCOMM:基於大型語言模型的多智能體系統中跨上下文KV緩存通信的線上優化
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
October 14, 2025
作者: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen
cs.AI
摘要
多智能體大型語言模型(LLM)系統在處理需要智能體間溝通與協調的複雜語言任務時,其應用日益廣泛。然而,這些系統常因智能體間重複處理重疊上下文而產生顯著開銷。在典型流程中,一旦智能體接收到前序智能體的消息,包括先前輪次在內的完整上下文必須從頭重新處理,導致處理效率低下。雖然鍵值(KV)緩存技術在單智能體場景下能有效避免前綴不變時的冗餘計算,但由於多智能體場景中智能體特有的上下文擴展導致前綴分叉,該技術無法直接複用。我們發現,核心挑戰在於跨智能體的KV緩存偏移量存在差異。為此,我們提出了KVCOMM,這是一個無需訓練的框架,通過在多智能體推理中重用KV緩存並在多樣前綴上下文下對齊重疊上下文的緩存偏移,實現高效預填充。KVCOMM通過參考一組存儲了不同前綴下觀察到的緩存偏差的示例(稱為錨點)來估計和調整共享內容的KV緩存。錨點池在線維護和更新,能夠動態適應不同的用戶請求和上下文結構。KVCOMM在多樣化的多智能體工作負載上實現了超過70%的緩存重用率,包括檢索增強生成、數學推理和協作編碼任務,且無需犧牲質量。特別是在五智能體設置下,當每個全連接智能體接收1K輸入令牌(其中512為前綴令牌,512為輸出令牌)時,KVCOMM相比標準預填充流程實現了最高7.8倍的加速,將首次令牌生成時間(TTFT)從約430毫秒縮短至約55毫秒。
English
Multi-agent large language model (LLM) systems are increasingly adopted for
complex language processing tasks that require communication and coordination
among agents. However, these systems often suffer substantial overhead from
repeated reprocessing of overlapping contexts across agents. In typical
pipelines, once an agent receives a message from its predecessor, the full
context-including prior turns-must be reprocessed from scratch, leading to
inefficient processing. While key-value (KV) caching is an effective solution
for avoiding redundant computation in single-agent settings where prefixes
remain unchanged, it cannot be directly reused in multi-agent scenarios due to
diverging prefixes introduced by agent-specific context extensions. We identify
that the core challenge lies in the offset variance of KV-caches across agents.
To address this, we propose KVCOMM, a training-free framework that enables
efficient prefilling in multi-agent inference by reusing KV-caches and aligning
cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM
estimates and adjusts KV-caches for shared content by referencing a pool of
cached examples-termed anchors-that store observed cache deviations under
varying prefixes. The anchor pool is maintained and updated online, allowing
dynamic adaptation to distinct user requests and context structures. KVCOMM
achieves over 70% reuse rate across diverse multi-agent workloads, including
retrieval-augmented generation, math reasoning, and collaborative coding tasks,
all without quality degradation. Particularly, when each fully-connected agent
receives 1K input tokens with 512 prefix tokens and 512 output tokens under a
five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard
prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.