KVCOMM: 효율적인 LLM 기반 다중 에이전트 시스템을 위한 온라인 교차 컨텍스트 KV 캐시 통신

초록

다중 에이전트 대규모 언어 모델(LLM) 시스템은 에이전트 간의 커뮤니케이션과 조정이 필요한 복잡한 언어 처리 작업에 점점 더 많이 채택되고 있습니다. 그러나 이러한 시스템은 종종 에이전트 간에 중복되는 컨텍스트를 반복적으로 재처리함으로써 상당한 오버헤드를 겪습니다. 일반적인 파이프라인에서, 한 에이전트가 이전 에이전트로부터 메시지를 받으면, 이전 대화를 포함한 전체 컨텍스트를 처음부터 다시 처리해야 하므로 비효율적인 처리가 발생합니다. 단일 에이전트 설정에서는 접두사가 변경되지 않는 경우 키-값(KV) 캐싱이 중복 계산을 피하는 효과적인 해결책이지만, 에이전트별 컨텍스트 확장으로 인해 접두사가 달라지는 다중 에이전트 시나리오에서는 이를 직접 재사용할 수 없습니다. 우리는 이러한 문제의 핵심이 에이전트 간 KV 캐시의 오프셋 변동성에 있음을 확인했습니다. 이를 해결하기 위해, 우리는 KVCOMM을 제안합니다. KVCOMM은 다양한 접두사 컨텍스트 하에서 중복되는 컨텍스트의 캐시 오프셋을 정렬하고 KV 캐시를 재사용함으로써 다중 에이전트 추론에서 효율적인 프리필링을 가능하게 하는 학습이 필요 없는 프레임워크입니다. KVCOMM은 다양한 접두사 하에서 관찰된 캐시 편차를 저장하는 앵커라고 불리는 캐시된 예제 풀을 참조하여 공유 콘텐츠에 대한 KV 캐시를 추정하고 조정합니다. 앵커 풀은 온라인으로 유지 및 업데이트되어, 다양한 사용자 요청과 컨텍스트 구조에 동적으로 적응할 수 있습니다. KVCOMM은 검색 강화 생성, 수학적 추론, 협업 코딩 작업을 포함한 다양한 다중 에이전트 워크로드에서 품질 저하 없이 70% 이상의 재사용률을 달성합니다. 특히, 5개 에이전트 설정에서 각각 완전히 연결된 에이전트가 1K 입력 토큰을 받고 512 접두사 토큰과 512 출력 토큰을 처리할 때, KVCOMM은 표준 프리필 파이프라인 대비 최대 7.8배의 속도 향상을 달성하며, TTFT를 ~430ms에서 ~55ms로 단축합니다.

English

Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

KVCOMM: 효율적인 LLM 기반 다중 에이전트 시스템을 위한 온라인 교차 컨텍스트 KV 캐시 통신

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

초록

Support