LRAgent:面向多LoRA大语言模型代理的高效KV缓存共享方案
LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents
February 1, 2026
作者: Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim
cs.AI
摘要
在多LLM智能体系统中,角色特化通常通过多LoRA适配器实现——智能体共享预训练主干网络,仅通过轻量级适配器区分。尽管共享基础模型权重,但各智能体会为相同的长工具增强轨迹独立构建并存储各自的KV缓存,导致显著的内存与计算开销。现有KV缓存共享方法大多忽视了这种多LoRA场景。我们观察到,不同智能体间的缓存差异主要源于适配器输出,而共享预训练主干网络的激活值仍保持高度相似。基于此发现,我们提出LRAgent:一种面向多LoRA智能体的KV缓存共享框架,将缓存解耦为来自预训练权重的共享基础组件和来自LoRA权重的适配器相关组件。LRAgent通过共享基础组件并以固有低秩形式存储适配器组件来降低内存开销,并借助共享注意力机制的多LoRA架构,通过共享低秩缓存并避免对其他智能体已处理上下文的冗余计算,进一步减少计算开销。为实现运行时高效重构适配器贡献,我们提出Flash-LoRA-Attention核函数,通过重排序注意力计算避免将低秩缓存展开至完整维度。在智能体问答基准测试中,LRAgent实现了接近全共享缓存的吞吐量与首词延迟,同时保持与非共享缓存基线相近的准确率。
English
Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared-A multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.