Tangram: 효율적인 다중 턴 LLM 서빙을 위한 비균일 KV 캐시 압축 해제

초록

다중 턴 LLM 서비스는 대화 기록을 축적하며, 턴과 사용자가 증가할 때마다 Key-Value(KV) 캐시가 함께 증가하여 모델 가중치 자체를 빠르게 초과하고, 연산이 아닌 메모리가 처리량의 제약 조건이 됩니다. 어텐션 헤드 간에 서로 다른 예산을 할당하는 비균일 KV 압축은 균일 방식보다 정확도를 훨씬 잘 유지하지만, 실제로는 실용적이지 않습니다. 현대 서비스 스택은 모든 헤드에서 동일한 KV 길이를 가정하므로, 이질성은 해제된 메모리를 페이지 단편화로 가두고, 프리필 시간의 최대 25%를 흩어진 페이지 회수에 소비하며, GPU 워크로드를 왜곡하여 디코드 지연 시간을 최대 1.7배 증가시키거나 각 디코드 단계의 15~20%를 재계획에 소모합니다. 우리는 이러한 이질성이 런타임에 발견될 필요가 없음을 관찰합니다. 헤드별 유지는 두 가지 수준의 구조적 규칙성을 따릅니다. 즉, 입력에 불변하는 헤드 순위와 좁게 제한된 헤드별 비율로, 이는 최소 50개의 샘플만으로 오프라인에서 보정될 수 있습니다. 이 통찰을 바탕으로, 우리는 이전 시스템이 동적으로 처리하던 문제를 정적으로 해결하는 서비스 프레임워크인 Tangram을 제시합니다. Budget Reservation은 스케줄링 시점에 각 헤드의 압축 후 풋프린트를 고정하여 페이지 회수를 없애고, Ragged Paging은 유사한 예산을 가진 헤드들을 독립적인 페이지 테이블로 클러스터링하여 단편화를 회수 가능한 메모리로 전환하며, Ahead-of-Time Load Balancing은 런타임 계획 없이 균형 잡힌 GPU 파티션을 사전 계산합니다. vLLM에 구현된 Tangram은 기존 비균일 압축 방법을 위한 플러그인 기반으로 작동하여, 해당 방법의 정확도를 유지하면서 전체 KV 기준선 대비 종단 간 처리량을 최대 2.6배 향상시킵니다. 우리의 구현은 https://github.com/aiha-lab/TANGRAM에서 공개적으로 이용 가능합니다.

English

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.