LLMLingua-2：資料精煉以實現高效且忠實的任務不可知提示壓縮

摘要

本文專注於任務不可知的提示壓縮，以提高通用性和效率。考慮到自然語言中的冗餘性，現有方法通過根據從因果語言模型（如LLaMa-7B）獲得的信息熵來刪除令牌或詞彙單元來壓縮提示。挑戰在於信息熵可能是一個次優的壓縮度量：（i）它僅利用單向上下文，可能無法捕捉所有提示壓縮所需的所有基本信息；（ii）它與提示壓縮目標不一致。為了解決這些問題，我們提出了一個數據蒸餾程序，從LLM中提取知識以在不丟失關鍵信息的情況下壓縮提示，同時引入了一個抽取式文本壓縮數據集。我們將提示壓縮定義為一個令牌分類問題，以確保壓縮後的提示對原始提示的忠實性，並使用Transformer編碼器作為基礎架構，從完全雙向上下文中捕捉提示壓縮所需的所有基本信息。我們的方法通過明確學習與較小模型（如XLM-RoBERTa-large和mBERT）一起的壓縮目標，實現了更低的延遲。我們在領域內和領域外數據集上對我們的方法進行評估，包括MeetingBank、LongBench、ZeroScrolls、GSM8K和BBH。儘管模型規模較小，但我們的模型在強基線上顯示出顯著的性能提升，並展示了在不同LLM上的強大泛化能力。此外，我們的模型比現有的提示壓縮方法快3倍至6倍，同時將端到端延遲加速1.6倍至2.9倍，壓縮比為2倍至5倍。

English

This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.

LLMLingua-2：資料精煉以實現高效且忠實的任務不可知提示壓縮

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

摘要

Support