Tangram：解鎖非均勻KV快取壓縮以實現高效多輪LLM服務

摘要

多轮对话的LLM服务会累积对话历史，其键值（KV）缓存随每一轮对话和每位用户不断增长，很快超过模型权重本身，使得内存——而非算力——成为吞吐量的制约瓶颈。非均匀KV压缩技术为各注意力头分配异构预算，在保持准确率方面远优于均匀方案，然而在实践中仍难以落地：现代服务栈假设各注意力头的KV长度一致，因此异构性会导致释放的内存因页面碎片化而陷入困境，预填充阶段需耗费高达25%的时间回收散落页面，同时扭曲GPU工作负载，使解码延迟膨胀至1.7倍，或令每一步解码操作中15%-20%的时间消耗在重新规划上。我们观察到这种异构性无需在运行时发现：注意力头维度的保留量遵循两级结构规律——输入无关的头排序与每头比率严格受限的边界——仅需50条样本即可离线校准。基于这一洞察，我们提出Tangram服务框架，将先前系统动态处理的内容转为静态解耦：预算预留机制在调度时锁定每个注意力头压缩后的内存占用，消除页面回收；参差分页机制将预算相近的注意力头聚类为独立页表，将碎片转化为可回收内存；预计算负载均衡机制无需运行时规划即可预先计算均衡的GPU分区。基于vLLM实现的Tangram可作为现有非均匀压缩方法的即插即用基础组件，在保持同等准确率的同时，端到端吞吐量相较完整KV基线提升最高2.6倍。我们的实现已开源：https://github.com/aiha-lab/TANGRAM。

English

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.