ChatPaper.aiChatPaper

卡带式表示:通过自学习实现的轻量级通用长上下文表征

Cartridges: Lightweight and general-purpose long context representations via self-study

June 6, 2025
作者: Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re
cs.AI

摘要

大型语言模型常被用于基于大规模文本语料(如代码库、法律文档或聊天记录)的查询应答,其方法是将整个语料置于上下文窗口内,并利用上下文学习(ICL)。尽管当前模型支持100K至1M标记的上下文,但这种配置的服务成本高昂,因为键值缓存(KV缓存)的内存消耗随输入长度线性增长。我们探索了一种替代方案:离线训练一个针对每个语料的较小KV缓存。在推理时,我们加载这个经过训练的KV缓存,称之为“卡带”(Cartridge),并解码生成响应。关键在于,训练卡带的成本可以在引用同一语料的所有查询中分摊。然而,我们发现,单纯通过在语料上进行下一标记预测来训练卡带的方法,其效果无法与ICL相媲美。为此,我们提出了“自学”(self-study)这一训练方案,即生成关于语料的合成对话,并以上下文蒸馏为目标训练卡带。我们发现,通过自学训练的卡带能够复制ICL的功能,同时服务成本显著降低。在具有挑战性的长上下文基准测试中,自学训练的卡带在性能上与ICL相当,但内存使用量减少了38.6倍,吞吐量提升了26.4倍。自学还扩展了模型的有效上下文长度(例如,在MTOB上从128k标记扩展至484k标记),并且令人惊讶的是,它使得卡带在推理时无需重新训练即可组合使用。
English
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.
PDF52June 10, 2025