대규모 종단 간 컨텍스트 압축

초록

긴 컨텍스트 언어 모델 추론은 KV 캐시가 컨텍스트 길이에 따라 증가함에 따라 메모리에 의해 병목 현상이 발생한다. 최근 KV 캐시를 압축하는 기술들은 한계가 있다: 모델 품질을 상당히 저하시키거나, 단일 긴 프롬프트를 압축하는 데 상당한 시간과 연산을 필요로 한다. 또한, 많은 방법들은 입력이 대상 모델의 컨텍스트 윈도우 내에 들어맞아야 하며, 현대적 프로덕션 추론 엔진과 일반적으로 호환되지 않는다. 긴 토큰 시퀀스를 디코더가 소비하는 더 짧은 잠재 임베딩 시퀀스로 매핑하는 인코더-디코더 압축기는 원칙적으로 매력적인 대안이다. 그러나 기존 접근 방식은 정확성-효율성 측면에서 KV 캐시 압축에 경쟁력이 없다. 본 연구에서는 인코더-디코더 압축을 재검토하여 이러한 격차를 해소한다. 먼저 아키텍처 탐색을 수행하여, 인코더-디코더 압축기를 최적으로 설계하고 훈련하는 방법을 결정하기 위해 많은 변형을 처음부터 사전 훈련한다. 발견된 내용을 바탕으로, 1:4, 1:8, 1:16의 압축 비율에서 각각 3500억 개 이상의 토큰에 대해 0.6B 인코더, 4B 디코더 모델군을 지속적으로 사전 훈련한다. 본 연구는 Latent Context Language Models (LCLMs), 즉 일반 작업 성능, 압축 속도, 최대 메모리 사용량에 걸쳐 파레토 프론티어를 개선하는 압축기군을 소개한다. LCLM이 장기 에이전트를 위한 효율적인 백본 역할을 하여, 에이전트가 압축된 긴 컨텍스트를 훑어보고 필요에 따라 관련 세그먼트를 적응적으로 확장할 수 있음을 입증한다.

English

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.