Tangram: 効率的なマルチターンLLMサービングのための不均一KVキャッシュ圧縮の実現

要旨

マルチターンLLMサービングでは、対話履歴が蓄積され、各ターンおよび各ユーザーごとにKey-Value（KV）キャッシュが増大する。これにより、キャッシュはモデル重み自体を急速に上回り、メモリがスループットの律速要因となり、計算量はもはや制約とならない。アテンションヘッド間で不均一な予算を割り当てる非一様KV圧縮は、一様な方式よりもはるかに高い精度を維持するが、実用的ではない。現代のサービングスタックはヘッド間で同一のKV長を前提とするため、不均一性により解放されたメモリがページ断片化として閉じ込められ、プリフィル時間の最大25%を散在ページの回収に費やし、さらにGPUワークロードを歪めてデコード遅延を最大1.7倍に増大させたり、各デコードステップの15～20%を再計画に消費したりする。我々は、この不均一性が実行時に発見される必要はないことを観察する。ヘッドごとの保持特性は、入力に依存しないヘッド順位と、狭い範囲に制限されたヘッドごとの比率を持つという2段階の構造的規則性に従い、わずか50サンプルからオフラインで調整可能である。この洞察に基づき、我々はTangramを提案する。これは、従来システムが動的に処理していた問題を静的に解決するサービングフレームワークである。Budget Reservationは、スケジューリング時に各ヘッドの圧縮後のフットプリントを固定し、ページ回収を排除する。Ragged Pagingは、類似予算のヘッドを独立したページテーブルにクラスタリングし、断片化を回収可能なメモリに変換する。Ahead-of-Time Load Balancingは、実行時の計画を一切必要とせずに、バランスの取れたGPUパーティションを事前計算する。vLLM上に実装されたTangramは、既存の非一様圧縮手法のドロップイン基盤として機能し、フルKVベースラインと比較してエンドツーエンドのスループットを最大2.6倍向上させながら、それらの精度を維持する。実装はhttps://github.com/aiha-lab/TANGRAMで公開されている。

English

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.