KV 預測以提高首個令牌到達時間。
KV Prediction for Improved Time to First Token
October 10, 2024
作者: Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi
cs.AI
摘要
基於Transformer的語言模型推論始於提示處理步驟。在此步驟中,模型生成第一個輸出標記並存儲未來生成步驟所需的KV快取。當提示長度或批次大小增加時,這個提示處理步驟可能在邊緣設備上變得計算昂貴,對十億參數模型而言可能需要10秒或更長時間。這將降低用戶體驗,導致模型輸出出現顯著的延遲。為了減少預訓練模型生成第一個輸出(稱為“首標記時間”,或TTFT)所需的時間,我們引入了一種名為KV Prediction的新方法。在我們的方法中,使用一個小型輔助模型來處理提示並生成基本模型使用的KV快取的近似值。然後,將這個近似的KV快取與基本模型一起用於自回歸生成,而無需再次查詢輔助模型。我們證明,與基準相比,我們的方法在效率和準確性之間產生了帕累托最優的折衷。在TriviaQA上,我們展示了在一系列TTFT FLOPs預算中相對準確性提升15%-50%的範圍。此外,我們展示了在固定的TTFT FLOPs預算下,在HumanEval的Python代碼完成中高達30%的準確性提升。此外,我們在Apple M2 Pro CPU上對模型進行基準測試,並展示我們在FLOPs上的改進轉化為硬體上的TTFT加速。我們在https://github.com/apple/corenet/tree/main/projects/kv-prediction 發布了我們的代碼。
English
Inference with transformer-based language models begins with a prompt
processing step. In this step, the model generates the first output token and
stores the KV cache needed for future generation steps. This prompt processing
step can be computationally expensive, taking 10s of seconds or more for
billion-parameter models on edge devices when prompt lengths or batch sizes
rise. This degrades user experience by introducing significant latency into the
model's outputs. To reduce the time spent producing the first output (known as
the ``time to first token'', or TTFT) of a pretrained model, we introduce a
novel method called KV Prediction. In our method, a small auxiliary model is
used to process the prompt and produce an approximation of the KV cache used by
a base model. This approximated KV cache is then used with the base model for
autoregressive generation without the need to query the auxiliary model again.
We demonstrate that our method produces a pareto-optimal efficiency-accuracy
trade-off when compared to baselines. On TriviaQA, we demonstrate relative
accuracy improvements in the range of 15%-50% across a range of TTFT FLOPs
budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval
python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark
models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs
translates to a TTFT speedup on hardware. We release our code at
https://github.com/apple/corenet/tree/main/projects/kv-prediction .