ChatPaper.aiChatPaper

端到端大規模上下文壓縮

End-to-End Context Compression at Scale

June 8, 2026
作者: Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov
cs.AI

摘要

長上下文語言模型的推理受到記憶體瓶頸的限制,因為KV快取會隨著上下文長度增長。近年來壓縮KV快取的技術仍存在不足:它們要麼顯著降低模型品質,要麼需要耗費大量時間與算力來壓縮單一長提示。此外,許多方法要求輸入內容能符合目標模型的上下文視窗,且通常與現代生產環境中的推理引擎不相容。編碼器-解碼器壓縮器原則上是一種具吸引力的替代方案——它能將長序列 tokens 映射為較短的潛在嵌入序列,供解碼器使用。然而,現有方法在準確性與效率的權衡上,仍無法與KV快取壓縮競爭。在本研究中,我們重新審視編碼器-解碼器壓縮,並縮小了這項差距。我們首先進行架構搜索,從零開始預訓練多種變體,以確定最佳設計與訓練編碼器-解碼器壓縮器的方法。根據研究結果,我們對一系列0.6B編碼器、4B解碼器的模型進行持續預訓練,每個模型使用超過350B tokens,壓縮比分別為1:4、1:8和1:16。我們提出潛在上下文語言模型(LCLMs),這一系列壓縮器改善了在通用任務表現、壓縮速度與峰值記憶體使用量上的帕累托前緣。我們證明LCLMs可作為長時程代理的高效基礎模型,讓代理能夠快速瀏覽壓縮後的長上下文,並視需求自適應地展開相關段落。
English
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.