随机化键值路由：实现自适应深度级缓存共享

摘要

在服务高吞吐量Transformer语言模型时，需通过缓存键值对(KV)来避免自回归生成中的冗余计算。KV缓存的内存占用显著，严重影响服务成本。本研究旨在降低此类内存需求。近期工作主要沿时间轴通过压缩和淘汰机制缩减KV缓存，而我们认为深度维度为优化提供了正交且稳健的新路径。尽管已有研究表明每层全量缓存存在冗余，但实现跨层缓存共享仍面临实际挑战：现有方法通常导致吞吐量下降或首字延迟增加。本文论证了丢弃某些层缓存可实现无损高效优化，并提出一种简单训练方法——随机跨层注意力机制。训练过程中，各层随机选择关注自身KV状态或前驱层的状态。这种随机化过程使模型能适应不同的深度缓存共享策略，确保部署时对未知硬件约束的灵活性。评估表明，在预训练或微调阶段应用此方案，可使多种模型架构实现深度缓存共享。此外，在数据受限场景下，该方法对大型模型表现出类正则化效果，常在显著降低缓存内存占用的同时维持甚至提升模型性能。

English

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the depth dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache's memory footprint.