FocusLLM: 並列デコーディングによるLLMのコンテキストのスケーリング

要旨

LLM（Large Language Model）が長いコンテキストから有用な情報を利用できるようにすることは、多くの下流アプリケーションにとって重要です。ただし、従来のトランスフォーマーアーキテクチャを使用して長いコンテキスト長を実現するには、かなりのトレーニングおよび推論リソースが必要です。本論文では、どのようなデコーダ専用LLMにもコンテキスト長を拡張し、モデルが非常に長いシーケンスから関連情報に焦点を当てることができるようにするために設計されたFocusLLMフレームワークを提案します。FocusLLMは、モデルの元のコンテキスト長に基づいてテキスト入力をチャンクに分割し、注意の散漫化の問題を緩和するために長いローカルコンテキストを各チャンクに追加し、各チャンクから重要な情報を抽出するプロンプトとして使用する革新的な並列デコーディングメカニズムに基づいています。そして、最終的に抽出された情報をローカルコンテキストに統合します。FocusLLMは、トレーニング効率と汎用性に優れており、以前の手法よりもはるかに少ないトレーニングコストで8Kの入力長でトレーニングされ、下流の長いコンテキストタスク全体で優れたパフォーマンスを発揮し、400Kトークンに達するまでの広範な長いテキストを処理する際に強力な言語モデリング能力を維持します。コードはhttps://github.com/leezythu/FocusLLM で入手可能です。

English

Empowering LLMs with the ability to utilize useful information from a long context is crucial for many downstream applications. However, achieving long context lengths with the conventional transformer architecture requires substantial training and inference resources. In this paper, we present FocusLLM, a framework designed to extend the context length of any decoder-only LLM, enabling the model to focus on relevant information from very long sequences. FocusLLM processes long text inputs by dividing them into chunks based on the model's original context length to alleviate the issue of attention distraction. Then, it appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism, and ultimately integrates the extracted information into the local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream long-context tasks and maintains strong language modeling ability when handling extensive long texts, even up to 400K tokens. Our code is available at https://github.com/leezythu/FocusLLM.

FocusLLM: 並列デコーディングによるLLMのコンテキストのスケーリング

FocusLLM: Scaling LLM's Context by Parallel Decoding

要旨

Support