LRAgent: 다중 LoRA LLM 에이전트를 위한 효율적인 KV 캐시 공유

초록

다중 LLM 에이전트 시스템에서 역할 전문화는 종종 멀티-로라를 통해 구현되며, 에이전트들은 사전 훈련된 백본을 공유하고 경량 어댑터만 다릅니다. 기본 모델 가중치를 공유함에도 불구하고, 각 에이전트는 동일한 긴 도구-증강 트랙젝토리에 대해 자체적인 KV 캐시를 독립적으로 구축하고 저장하여 상당한 메모리 및 계산 오버헤드가 발생합니다. 기존 KV 캐시 공유 방법은 대체로 이 멀티-로라 설정을 간과해 왔습니다. 우리는 에이전트 간에 캐시 차이가 주로 어댑터 출력에 의해 결정되는 반면, 공유된 사전 훈련 백본으로부터의 활성화는 매우 유사하게 유지된다는 점을 관찰했습니다. 이 관찰을 바탕으로, 우리는 멀티-로라 에이전트를 위한 KV 캐시 공유 프레임워크인 LRAgent를 제안합니다. LRAgent는 캐시를 사전 훈련 가중치로부터의 공유 기본 구성 요소와 로라 가중치로부터의 어댑터 종속 구성 요소로 분해합니다. LRAgent는 기본 구성 요소를 공유하고 어댑터 구성 요소를 본질적인 저-랭크 형태로 저장하여 메모리 오버헤드를 줄이며, 공유-A 멀티-로라 아키텍처에서 가능해진 계산 오버헤드도 추가로 감소시킵니다. 이는 저-랭크 캐시를 공유하고 다른 에이전트들이 이미 처리한 컨텍스트에 대한 중복 계산을 피함으로써 이루어집니다. 런타임에 어댑터 기여도를 효율적으로 재구성하기 위해, 우리는 저-랭크 캐시를 전체 차원으로 구체화하는 것을 피하도록 어텐션 계산 순서를 재배열하는 커널인 Flash-LoRA-Attention을 도입합니다. LRAgent는 완전 공유 캐싱에 가까운 처리량과 첫 토큰 지연 시간을 달성하면서도, 에이전트 질의-응답 벤치마크 전반에 걸쳐 비-공유 캐싱 기준선에 가까운 정확도를 유지합니다.

English

Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared-A multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.

LRAgent: 다중 LoRA LLM 에이전트를 위한 효율적인 KV 캐시 공유

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

초록

Support