Tangram: 解锁非均匀KV缓存压缩以实现高效的多轮大语言模型服务
Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving
June 15, 2026
作者: Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi
cs.AI
摘要
多轮LLM服务中,对话历史积累会导致每次对话与每个用户的键值(KV)缓存持续增长,其规模迅速超过模型权重自身,使得内存(而非计算)成为吞吐量的核心约束。非均匀KV压缩通过在不同注意力头间分配差异化预算,在保持精度上显著优于均匀方案,但实际应用中仍存在难题:现代服务框架假设各注意力头的KV长度相同,导致非均匀压缩产生的空闲内存以页面碎片形式存在,预填充阶段需耗费高达25%的时间回收分散页面,且GPU工作负载不均使得解码延迟增加1.7倍,或每次解码步骤中15%–20%的计算资源被重规划消耗。我们观察到,这种非均匀性无需运行时发现:注意力头的保留特征遵循两层次结构规律——输入无关的头部排序和每个头部有限的偏移比率——仅需50个样本即可离线校准。基于此洞察,我们提出Tangram服务框架,以静态方式解决先前系统需动态处理的问题:预算预留机制在调度时固定每个头部压缩后的内存占用,消除页面回收;参差不齐分页技术将预算相似的头部聚类到独立页表中,将碎片转化为可回收内存;预计算负载均衡则预计算平衡的GPU分区,无需运行时规划。基于vLLM实现的Tangram可作为现有非均匀压缩方法的即插即用基座,在保持精度的同时,将端到端吞吐量相比完整KV基线提升最高2.6倍。我们的实现已开源:https://github.com/aiha-lab/TANGRAM。
English
Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.