大规模端到端上下文压缩

摘要

长上下文语言模型推理受限于内存，因为KV缓存随上下文长度增长。最近的KV缓存压缩技术存在不足：要么大幅降低模型质量，要么需要大量时间和计算资源来压缩单个长提示。此外，许多方法要求输入长度不超过目标模型的上下文窗口，且通常与现代生产推理引擎不兼容。编码器-解码器压缩器将长词元序列映射为更短的潜在嵌入序列供解码器使用，理论上是一种有吸引力的替代方案。然而，现有方法在准确性与效率的权衡上无法与KV缓存压缩竞争。在本文中，我们重新审视了编码器-解码器压缩，并弥合了这一差距。我们首先进行架构搜索，从头预训练多种变体，以确定设计和训练编码器-解码器压缩器的最佳方式。根据研究结果，我们持续预训练了一系列0.6B编码器、4B解码器的模型，每个模型在超过350B词元上进行训练，压缩比分别为1:4、1:8和1:16。我们提出了潜在上下文语言模型（LCLMs），这是一系列压缩器，在通用任务性能、压缩速度和峰值内存使用方面改善了帕累托前沿。我们证明了LCLMs可作为长周期智能体的高效骨干网络，使智能体能够浏览压缩后的长上下文，并根据需要自适应地展开相关片段。

English

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.