端到端大規模上下文壓縮

摘要

長上下文語言模型的推理受到記憶體瓶頸的限制，因為KV快取會隨著上下文長度增長。近年來壓縮KV快取的技術仍存在不足：它們要麼顯著降低模型品質，要麼需要耗費大量時間與算力來壓縮單一長提示。此外，許多方法要求輸入內容能符合目標模型的上下文視窗，且通常與現代生產環境中的推理引擎不相容。編碼器-解碼器壓縮器原則上是一種具吸引力的替代方案——它能將長序列 tokens 映射為較短的潛在嵌入序列，供解碼器使用。然而，現有方法在準確性與效率的權衡上，仍無法與KV快取壓縮競爭。在本研究中，我們重新審視編碼器-解碼器壓縮，並縮小了這項差距。我們首先進行架構搜索，從零開始預訓練多種變體，以確定最佳設計與訓練編碼器-解碼器壓縮器的方法。根據研究結果，我們對一系列0.6B編碼器、4B解碼器的模型進行持續預訓練，每個模型使用超過350B tokens，壓縮比分別為1:4、1:8和1:16。我們提出潛在上下文語言模型（LCLMs），這一系列壓縮器改善了在通用任務表現、壓縮速度與峰值記憶體使用量上的帕累托前緣。我們證明LCLMs可作為長時程代理的高效基礎模型，讓代理能夠快速瀏覽壓縮後的長上下文，並視需求自適應地展開相關段落。

English

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.