KVCOMM: 効率的なLLMベースのマルチエージェントシステムのためのオンラインクロスコンテキストKVキャッシュ通信

要旨

マルチエージェント大規模言語モデル（LLM）システムは、エージェント間のコミュニケーションと協調を必要とする複雑な言語処理タスクにおいて、ますます採用されています。しかし、これらのシステムは、エージェント間で重複するコンテキストの繰り返し処理による大幅なオーバーヘッドに悩まされることが多いです。典型的なパイプラインでは、エージェントが前のエージェントからメッセージを受信すると、以前のターンを含む完全なコンテキストを最初から再処理する必要があり、非効率な処理が生じます。キー・バリュー（KV）キャッシュは、プレフィックスが変わらない単一エージェント設定での冗長な計算を回避するための効果的な解決策ですが、エージェント固有のコンテキスト拡張によって導入される異なるプレフィックスのため、マルチエージェントシナリオでは直接再利用できません。我々は、KVキャッシュのオフセットのばらつきが核心的な課題であることを特定しました。これを解決するために、KVCOMMを提案します。KVCOMMは、多様なプレフィックスコンテキスト下で重複するコンテキストのキャッシュオフセットを調整し、KVキャッシュを再利用することで、マルチエージェント推論における効率的なプレフィリングを可能にするトレーニング不要のフレームワークです。KVCOMMは、共有コンテンツのKVキャッシュを推定し調整するために、異なるプレフィックス下で観測されたキャッシュの偏差を保存するアンカーのプールを参照します。アンカープールはオンラインで維持・更新され、異なるユーザーリクエストやコンテキスト構造に動的に適応できます。KVCOMMは、検索拡張生成、数学的推論、協調的コーディングタスクを含む多様なマルチエージェントワークロードにおいて、品質の低下なしに70%以上の再利用率を達成します。特に、5エージェント設定下で、各完全接続エージェントが1Kの入力トークン（512プレフィックストークンと512出力トークン）を受信する場合、KVCOMMは標準のプレフィルパイプラインと比較して最大7.8倍の高速化を実現し、TTFTを約430msから約55msに短縮します。

English

Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

KVCOMM: 効率的なLLMベースのマルチエージェントシステムのためのオンラインクロスコンテキストKVキャッシュ通信

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

要旨

Support