LLMLingua-2：用于高效且忠实的任务无关提示压缩的数据提炼

摘要

本文侧重于面向任务不可知的提示压缩，以提高泛化能力和效率。考虑到自然语言中的冗余性，现有方法通过根据从因果语言模型（如LLaMa-7B）获得的信息熵来移除标记或词汇单元来压缩提示。挑战在于信息熵可能是次优的压缩度量：（i）它仅利用单向上下文，可能无法捕获提示压缩所需的所有关键信息；（ii）它与提示压缩目标不一致。为解决这些问题，我们提出了一种数据提炼过程，从LLM中获取知识以压缩提示而不丢失关键信息，并同时引入一个抽取式文本压缩数据集。我们将提示压缩形式化为一个标记分类问题，以确保压缩后的提示与原始提示的忠实性，并使用Transformer编码器作为基础架构，从完全双向上下文中捕获提示压缩的所有关键信息。我们的方法通过明确学习与较小模型（如XLM-RoBERTa-large和mBERT）一起的压缩目标，实现了更低的延迟。我们在领域内和领域外数据集上评估了我们的方法，包括MeetingBank、LongBench、ZeroScrolls、GSM8K和BBH。尽管规模较小，我们的模型表现出明显的性能提升，展示了对不同LLM的稳健泛化能力。此外，我们的模型比现有的提示压缩方法快3倍至6倍，同时通过2倍至5倍的压缩比加速端到端延迟1.6倍至2.9倍。

English

This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.

LLMLingua-2：用于高效且忠实的任务无关提示压缩的数据提炼

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

摘要

Support