CompLLM：面向长上下文问答的压缩技术

摘要

大型语言模型（LLMs）在处理长上下文时面临显著的计算挑战，这源于自注意力机制的二次方复杂度。尽管软上下文压缩方法——将输入文本映射到更小的潜在表示——已展现出潜力，但其在实际应用中的采纳度有限。现有技术通常将上下文作为一个整体进行压缩，这导致了二次方的压缩复杂度，并且无法在具有重叠上下文的不同查询间复用计算结果。本研究中，我们提出了CompLLM，一种专为实际部署设计的软压缩技术。不同于整体处理上下文，CompLLM将其分割成多个片段并独立压缩每个片段。这一简洁的设计选择带来了三个关键特性：效率性，压缩步骤随上下文长度线性扩展；可扩展性，使模型能在短序列（如1k个标记）上训练后，泛化至100k标记的上下文；以及可复用性，允许压缩后的片段被缓存并在不同查询间重复使用。实验表明，在2倍压缩率下，CompLLM在高上下文长度下将首次令牌生成时间（TTFT）加速至多4倍，并将键值缓存大小减少50%。此外，CompLLM在性能上可与未压缩上下文相媲美，甚至在超长序列上表现更优，充分证明了其有效性和实用价值。

English

Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.

CompLLM：面向长上下文问答的压缩技术

CompLLM: Compression for Long Context Q&A

摘要

Support