LLoCO: 長文脈のオフライン学習

要旨

長文脈の処理は、大規模言語モデル（LLM）にとって依然として課題となっています。これは、自己注意機構の二次的な計算量とメモリオーバーヘッド、および生成時の大規模なKVキャッシュサイズによるものです。この問題に対処するため、我々は文脈圧縮とドメイン内でのパラメータ効率的なファインチューニングを通じて、オフラインで文脈を学習する新しいアプローチを提案します。この手法により、LLMは元の文脈の簡潔な表現を作成し、質問に正確に答えるために必要な情報を効率的に検索することが可能になります。我々は、文脈圧縮、検索、およびLoRAを用いたパラメータ効率的なファインチューニングを組み合わせた技術であるLLoCOを紹介します。このアプローチにより、4kトークンのLLaMA2-7Bモデルの有効な文脈ウィンドウを拡張し、最大128kトークンを処理できるようにします。我々は、いくつかの長文脈質問応答データセットでこのアプローチを評価し、LLoCOがインコンテキスト学習を大幅に上回り、推論時に30倍少ないトークンを使用することを示しました。LLoCOは最大7.62倍の高速化を実現し、長文書の質問応答のコストを大幅に削減するため、効率的な長文脈処理の有望なソリューションとなります。我々のコードはhttps://github.com/jeffreysijuntan/llocoで公開されています。

English

Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using 30times fewer tokens during inference. LLoCO achieves up to 7.62times speed-up and substantially reduces the cost of long document question answering, making it a promising solution for efficient long context processing. Our code is publicly available at https://github.com/jeffreysijuntan/lloco.

LLoCO: 長文脈のオフライン学習

LLoCO: Learning Long Contexts Offline

要旨

Support