LazyLLM：用于高效长上下文LLM推理的动态标记修剪

摘要

基于Transformer的大型语言模型的推理包括两个连续阶段：1）预填充阶段用于计算提示的KV缓存并生成第一个标记，2）解码阶段用于生成后续标记。对于长提示，必须在预填充阶段为所有标记计算KV缓存，这可能会显著增加生成第一个标记所需的时间。因此，预填充阶段可能成为生成过程中的瓶颈。一个未解决的问题是是否所有提示标记对于生成第一个标记都是必要的。为了回答这个问题，我们引入了一种新方法LazyLLM，该方法有选择地计算在预填充和解码阶段对下一个标记预测重要的标记的KV。与一次性修剪提示的静态修剪方法相反，LazyLLM允许语言模型在不同生成步骤中动态选择来自上下文的不同标记子集，即使它们在先前的步骤中可能已被修剪。对各种任务的标准数据集进行的大量实验表明，LazyLLM是一种通用方法，可以与现有语言模型无缝集成，显著加速生成而无需微调。例如，在多文档问答任务中，LazyLLM将LLama 2 7B模型的预填充阶段加速了2.34倍，同时保持准确性。

English

The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

LazyLLM：用于高效长上下文LLM推理的动态标记修剪

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

摘要

Support