ChatPaper.aiChatPaper

LazyLLM:用於高效長上下文LLM推論的動態標記修剪

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

July 19, 2024
作者: Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
cs.AI

摘要

基於Transformer的大型語言模型的推論包含兩個連續階段:1) 預填充階段用於計算提示的KV快取並生成第一個標記,2) 解碼階段用於生成後續標記。對於長提示,必須在預填充階段為所有標記計算KV快取,這可能會顯著增加生成第一個標記所需的時間。因此,預填充階段可能成為生成過程中的瓶頸。一個未解決的問題是所有提示標記對於生成第一個標記是否都是必要的。為了回答這個問題,我們引入了一種新方法LazyLLM,它選擇性地在預填充和解碼階段為下一個標記預測中重要的標記計算KV。與一次性修剪提示的靜態修剪方法相反,LazyLLM允許語言模型在不同生成步驟中動態選擇來自上下文的不同標記子集,即使它們在先前步驟中被修剪。對標準數據集上各種任務的大量實驗表明,LazyLLM是一種通用方法,可以與現有語言模型無縫集成,從而顯著加速生成速度而無需微調。例如,在多文檔問答任務中,LazyLLM將LLama 27B模型的預填充階段加速了2.34倍,同時保持準確性。
English
The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

Summary

AI-Generated Summary

PDF463November 28, 2024