ChatPaper.aiChatPaper

浅层预填充,深层解码:通过跨层非对称KV可见性实现高效长上下文推理

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

May 7, 2026
作者: Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee
cs.AI

摘要

在仅解码器语言模型中进行长上下文推理成本高昂,因为长提示在预填充阶段被处理,在每一层进行缓存,并在自回归解码阶段被反复关注。我们提出浅层预填充、深度解码(SPEED)方法,这是一种阶段不对称的 KV 可见性策略,仅在下层实例化非锚定提示令牌的 KV 状态,同时保持解码阶段令牌的全深度。与以往使上层提示 KV 状态更易存储或构建的方法不同,SPEED 完全将预填充令牌从上层解码可见性集合中移除。借助一个最小的序列开始锚点,这一简单改动在降低长上下文成本的同时,保持了广泛的基准测试质量。在受控的 Llama-3.1-8B 指令微调研究中,SPEED 仅使用 75% 的层来容纳预填充令牌,在 OLMES 风格基准测试上取得了 51.2 的平均分,而全深度基线为 51.4,同时将首次令牌生成时间降低了 33%,每输出令牌时间降低了 22%,并在 128K 上下文下将活跃 KV 内存减少了 25.0%。逐层诊断表明,这种截断保留了全深度模型的主要提示选择区域和表示稳定化区域。这些结果表明,当解码阶段令牌保持全深度时,长上下文提示令牌无需始终作为全深度 KV 缓存对象持久存在。
English
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
PDF11May 12, 2026