浅层预填充，深层解码：通过跨层非对称KV可见性实现高效长上下文推理

摘要

在仅解码器语言模型中进行长上下文推理成本高昂，因为长提示在预填充阶段被处理，在每一层进行缓存，并在自回归解码阶段被反复关注。我们提出浅层预填充、深度解码（SPEED）方法，这是一种阶段不对称的 KV 可见性策略，仅在下层实例化非锚定提示令牌的 KV 状态，同时保持解码阶段令牌的全深度。与以往使上层提示 KV 状态更易存储或构建的方法不同，SPEED 完全将预填充令牌从上层解码可见性集合中移除。借助一个最小的序列开始锚点，这一简单改动在降低长上下文成本的同时，保持了广泛的基准测试质量。在受控的 Llama-3.1-8B 指令微调研究中，SPEED 仅使用 75% 的层来容纳预填充令牌，在 OLMES 风格基准测试上取得了 51.2 的平均分，而全深度基线为 51.4，同时将首次令牌生成时间降低了 33%，每输出令牌时间降低了 22%，并在 128K 上下文下将活跃 KV 内存减少了 25.0%。逐层诊断表明，这种截断保留了全深度模型的主要提示选择区域和表示稳定化区域。这些结果表明，当解码阶段令牌保持全深度时，长上下文提示令牌无需始终作为全深度 KV 缓存对象持久存在。

English

Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.

浅层预填充，深层解码：通过跨层非对称KV可见性实现高效长上下文推理

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

摘要

Support