KVCOMM:面向高效LLM多智能体系统的跨上下文KV缓存在线通信机制
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
October 14, 2025
作者: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen
cs.AI
摘要
多智能体大语言模型(LLM)系统正日益被应用于需要智能体间沟通与协调的复杂语言处理任务中。然而,这些系统常因智能体间重叠上下文的重复处理而承受巨大开销。在典型流程中,一旦智能体接收到前序节点的消息,包括先前轮次在内的完整上下文必须从头重新处理,导致处理效率低下。尽管键值(KV)缓存是避免单智能体场景中前缀不变时冗余计算的有效方案,但由于智能体特定上下文扩展引入的前缀差异,它无法直接应用于多智能体场景。我们识别出核心挑战在于跨智能体的KV缓存偏移量差异。为此,我们提出了KVCOMM,一个无需训练的高效预填充框架,通过复用KV缓存并在多样化前缀上下文中对齐重叠上下文的缓存偏移量,实现多智能体推理的高效性。KVCOMM通过参考一组缓存示例——称为锚点——来估计并调整共享内容的KV缓存,这些锚点存储了在不同前缀下观察到的缓存偏差。锚点池在线维护和更新,能够动态适应不同的用户请求和上下文结构。KVCOMM在包括检索增强生成、数学推理和协作编码任务在内的多种多智能体工作负载中实现了超过70%的复用率,且无质量下降。特别是在五智能体设置下,每个全连接智能体接收1K输入令牌,其中512为前缀令牌,512为输出令牌时,KVCOMM相比标准预填充流程实现了高达7.8倍的加速,将首次令牌生成时间(TTFT)从约430毫秒降至约55毫秒。
English
Multi-agent large language model (LLM) systems are increasingly adopted for
complex language processing tasks that require communication and coordination
among agents. However, these systems often suffer substantial overhead from
repeated reprocessing of overlapping contexts across agents. In typical
pipelines, once an agent receives a message from its predecessor, the full
context-including prior turns-must be reprocessed from scratch, leading to
inefficient processing. While key-value (KV) caching is an effective solution
for avoiding redundant computation in single-agent settings where prefixes
remain unchanged, it cannot be directly reused in multi-agent scenarios due to
diverging prefixes introduced by agent-specific context extensions. We identify
that the core challenge lies in the offset variance of KV-caches across agents.
To address this, we propose KVCOMM, a training-free framework that enables
efficient prefilling in multi-agent inference by reusing KV-caches and aligning
cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM
estimates and adjusts KV-caches for shared content by referencing a pool of
cached examples-termed anchors-that store observed cache deviations under
varying prefixes. The anchor pool is maintained and updated online, allowing
dynamic adaptation to distinct user requests and context structures. KVCOMM
achieves over 70% reuse rate across diverse multi-agent workloads, including
retrieval-augmented generation, math reasoning, and collaborative coding tasks,
all without quality degradation. Particularly, when each fully-connected agent
receives 1K input tokens with 512 prefix tokens and 512 output tokens under a
five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard
prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.