大規模エンドツーエンドコンテキスト圧縮

要旨

長文コンテキスト言語モデルの推論は、コンテキスト長に応じてKVキャッシュが増大するため、メモリがボトルネックとなる。近年提案されているKVキャッシュ圧縮技術には限界がある。モデルの品質を大幅に低下させるか、長いプロンプトを圧縮するのに多大な時間と計算リソースを要する。さらに、多くの手法では入力が対象モデルのコンテキストウィンドウに収まる必要があり、現代のプロダクション推論エンジンとの互換性が一般的にない。エンコーダ・デコーダ圧縮器は、長いトークン系列を、デコーダで消費されるより短い潜在埋め込み系列に写像するもので、原理的には魅力的な代替手段である。しかし、既存のアプローチは精度と効率のトレードオフにおいてKVキャッシュ圧縮に勝るものではない。本研究では、エンコーダ・デコーダ圧縮を再検討し、このギャップを埋める。まずアーキテクチャ探索を行い、多くのバリアントをスクラッチから事前学習して、エンコーダ・デコーダ圧縮器を最適に設計・訓練する方法を決定する。その知見に基づき、圧縮比1:4、1:8、1:16において、それぞれ350Bトークン以上で0.6Bエンコーダ、4Bデコーダのモデル群を継続事前学習する。潜在コンテキスト言語モデル（LCLM）を導入する。これは、汎用タスク性能、圧縮速度、ピークメモリ使用量におけるパレートフロンティアを改善する圧縮器群である。LCLMが長期エージェントの効率的なバックボーンとして機能し、エージェントが圧縮された長いコンテキストをざっと読み、必要に応じて関連するセグメントを適応的に拡張できることを実証する。

English

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.