LazyLLM:用于高效长上下文LLM推理的动态标记修剪
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
July 19, 2024
作者: Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
cs.AI
摘要
基于Transformer的大型语言模型的推理包括两个连续阶段:1)预填充阶段用于计算提示的KV缓存并生成第一个标记,2)解码阶段用于生成后续标记。对于长提示,必须在预填充阶段为所有标记计算KV缓存,这可能会显著增加生成第一个标记所需的时间。因此,预填充阶段可能成为生成过程中的瓶颈。一个未解决的问题是是否所有提示标记对于生成第一个标记都是必要的。为了回答这个问题,我们引入了一种新方法LazyLLM,该方法有选择地计算在预填充和解码阶段对下一个标记预测重要的标记的KV。与一次性修剪提示的静态修剪方法相反,LazyLLM允许语言模型在不同生成步骤中动态选择来自上下文的不同标记子集,即使它们在先前的步骤中可能已被修剪。对各种任务的标准数据集进行的大量实验表明,LazyLLM是一种通用方法,可以与现有语言模型无缝集成,显著加速生成而无需微调。例如,在多文档问答任务中,LazyLLM将LLama 2 7B模型的预填充阶段加速了2.34倍,同时保持准确性。
English
The inference of transformer-based large language models consists of two
sequential stages: 1) a prefilling stage to compute the KV cache of prompts and
generate the first token, and 2) a decoding stage to generate subsequent
tokens. For long prompts, the KV cache must be computed for all tokens during
the prefilling stage, which can significantly increase the time needed to
generate the first token. Consequently, the prefilling stage may become a
bottleneck in the generation process. An open question remains whether all
prompt tokens are essential for generating the first token. To answer this, we
introduce a novel method, LazyLLM, that selectively computes the KV for tokens
important for the next token prediction in both the prefilling and decoding
stages. Contrary to static pruning approaches that prune the prompt at once,
LazyLLM allows language models to dynamically select different subsets of
tokens from the context in different generation steps, even though they might
be pruned in previous steps. Extensive experiments on standard datasets across
various tasks demonstrate that LazyLLM is a generic method that can be
seamlessly integrated with existing language models to significantly accelerate
the generation without fine-tuning. For instance, in the multi-document
question-answering task, LazyLLM accelerates the prefilling stage of the LLama
2 7B model by 2.34x while maintaining accuracy.Summary
AI-Generated Summary