ChatPaper.aiChatPaper

淺層預填充,深層解碼:通過層非對稱KV可見性實現高效長上下文推理

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

May 7, 2026
作者: Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee
cs.AI

摘要

僅解碼器語言模型中的長上下文推論成本高昂,因為長提示在預填充階段被處理、在每一層被快取,並在自迴歸解碼階段反覆被關注。我們提出「淺層預填充、深層解碼」(SPEED),這是一種相位不對稱的KV可視性策略,僅在下層層級實體化非錨點提示詞元的KV狀態,同時保持解碼階段詞元的全深度。不同於先前讓上層提示KV狀態更易於儲存或建構的方法,SPEED徹底將預填充詞元從上層解碼的可視集合中移除。藉由最小的起始詞元(BoS)錨點,這項簡單的改變在減少長上下文成本的同時,保留了廣泛的基準品質。在一項受控的Llama-3.1-8B指令調校研究中,僅使用75%層數存放預填充詞元的SPEED,在OLMES風格基準上達到51.2的平均分數,而全深度基準線為51.4,同時在128K上下文下,TTFT改善33%、TPOT改善22%,並將活躍KV記憶體減少25.0%。逐層診斷顯示,此截斷保留了全深度模型的主要提示選擇與表徵穩定區域。這些結果表明,當解碼階段詞元保持全深度時,長上下文提示詞元並不一定需要始終作為全深度的KV快取物件存在。
English
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
PDF11May 12, 2026