淺層預填充，深層解碼：通過層非對稱KV可見性實現高效長上下文推理

摘要

僅解碼器語言模型中的長上下文推論成本高昂，因為長提示在預填充階段被處理、在每一層被快取，並在自迴歸解碼階段反覆被關注。我們提出「淺層預填充、深層解碼」（SPEED），這是一種相位不對稱的KV可視性策略，僅在下層層級實體化非錨點提示詞元的KV狀態，同時保持解碼階段詞元的全深度。不同於先前讓上層提示KV狀態更易於儲存或建構的方法，SPEED徹底將預填充詞元從上層解碼的可視集合中移除。藉由最小的起始詞元（BoS）錨點，這項簡單的改變在減少長上下文成本的同時，保留了廣泛的基準品質。在一項受控的Llama-3.1-8B指令調校研究中，僅使用75%層數存放預填充詞元的SPEED，在OLMES風格基準上達到51.2的平均分數，而全深度基準線為51.4，同時在128K上下文下，TTFT改善33%、TPOT改善22%，並將活躍KV記憶體減少25.0%。逐層診斷顯示，此截斷保留了全深度模型的主要提示選擇與表徵穩定區域。這些結果表明，當解碼階段詞元保持全深度時，長上下文提示詞元並不一定需要始終作為全深度的KV快取物件存在。

English

Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.

淺層預填充，深層解碼：通過層非對稱KV可見性實現高效長上下文推理

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

摘要

Support