ChatPaper.aiChatPaper

卡匣:通過自我學習實現的輕量級通用長上下文表徵

Cartridges: Lightweight and general-purpose long context representations via self-study

June 6, 2025
作者: Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re
cs.AI

摘要

大型语言模型常被用于回答基于大规模文本语料库(如代码库、法律文件或聊天记录)的查询,其方法是将整个语料库置于上下文窗口中,并利用上下文学习(ICL)技术。尽管当前模型支持100K至1M标记的上下文,但这种配置的服务成本高昂,因为键值缓存(KV cache)的内存消耗随输入长度线性增长。我们探索了一种替代方案:针对每个语料库离线训练一个较小的KV缓存。在推理时,我们加载这一经过训练的KV缓存,称之为“卡带”(Cartridge),并解码出响应。关键在于,训练卡带的成本可以在引用同一语料库的所有查询中分摊。然而,我们发现,直接在语料库上进行下一标记预测来训练卡带的朴素方法,其效果无法与ICL相媲美。为此,我们提出了“自学”(self-study)这一训练方案,即生成关于语料库的合成对话,并以上下文蒸馏为目标训练卡带。我们发现,通过自学训练的卡带能够复制ICL的功能,同时服务成本显著降低。在具有挑战性的长上下文基准测试中,自学训练的卡带在保持与ICL相当性能的同时,内存使用量减少了38.6倍,吞吐量提升了26.4倍。自学还扩展了模型的有效上下文长度(例如,在MTOB上从128k标记扩展至484k标记),并且令人惊讶的是,它使得卡带在推理时无需重新训练即可组合使用。
English
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.
PDF52June 10, 2025