浅層プリフィル、深層デコーディング：層非対称KV可視性による効率的な長文脈推論

要旨

デコーダー専用言語モデルにおける長文脈推論は、長いプロンプトがプリフィル中に処理され、各層でキャッシュされ、自己回帰型デコード中に繰り返し注目されるため、コストが高い。我々はShallow Prefill, dEEp Decode (SPEED)を導入する。これは非対称な位相別KV可視性ポリシーであり、非アンカープロンプトトークンのKV状態を下位層のみに実体化し、デコードフェーズのトークンは全層において保持する。先行研究が上位層のプロンプトKV状態の保存や構築を低コスト化するのに対し、SPEEDは上位層のデコード可視性セットからプリフィルトークンを完全に除外する。最小限のBoSアンカーを用いることで、この単純な変更はベンチマーク品質を広範に維持しつつ、長文脈のコストを削減する。制御されたLlama-3.1-8B指示チューニング実験において、プリフィルトークンに全層の75％のみを使用するSPEEDは、全層ベースラインの51.4に対して、OLMES形式ベンチマークで平均スコア51.2を達成し、TTFTを33％、TPOTを22％改善し、128KコンテキストでのアクティブKVメモリを25.0％削減する。層別の診断は、このカットオフが全層モデルの主要なプロンプト選択領域と表現安定化領域を保持することを示唆する。これらの結果は、デコードフェーズのトークンが全層に残る場合、長文脈プロンプトトークンは常に全層KVキャッシュオブジェクトとして持続する必要がないことを示している。

English

Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.

浅層プリフィル、深層デコーディング：層非対称KV可視性による効率的な長文脈推論

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

要旨

Support