OCTOPUS: 在最优平方误差量化下通过八面体参数化实现的Transformer键值缓存优化
OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization
May 20, 2026
作者: Mark Boss, Vikram Voleti, Simon Donné, Shimon Vainer
cs.AI
摘要
键值(KV)缓存主导了长上下文自回归推理中的内存带宽和占用。最近的旋转预条件编解码器(TurboQuant、PolarQuant)表明,结构化随机旋转后接与解析可处理的边缘分布匹配的逐坐标标量量化器,是KV压缩的近最优方案。OCTOPUS通过联合量化旋转后的坐标三元组推进了这一范式。每个三元组的方向通过八面体参数化映射至正方形,由此产生的两个坐标及三元组范数均依据实现匹配的边缘分布进行Lloyd-Max量化。优化每个三元组的均方误差可得到严格非均匀的比特分配,该分配仅依赖于键的总维度。通过扫描我们发现,在测试的每个真实解码器上,有限维质量最优值保持恒定。该编解码器无数据依赖性、在线运行且给定种子后具有确定性。在文本、视频和音频任务中,OCTOPUS在每个报告位宽和指标上均达到或超越所有先前旋转编解码器,且随着比特率降低以实现极端压缩,其领先优势进一步扩大。此外,融合的Triton实现可即时重构键,无需物化未压缩的键,因此编解码器不会在解码时引入额外的带宽或延迟。项目页面:https://octopus-quant.github.io/
English
The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/