LLMの有効なコンテキスト長が不十分な理由は何ですか？

要旨

分散トレーニングと効率的な注意機構の進歩により、大規模言語モデル（LLMs）のコンテキストウィンドウサイズが大幅に拡大しました。しかし、最近の研究では、オープンソースのLLMsの有効なコンテキスト長がしばしば不十分であり、通常はトレーニング長の半分を超えることはありません。本研究では、LLMsの事前トレーニングおよび事後トレーニング段階で形成される相対位置の左偏った頻度分布が、遠くの情報を効果的に収集する能力を妨げていると考えています。この課題に対処するために、ShifTed Rotray位置埋め込み（STRING）を導入します。STRINGは、トレーニングされた位置をシフトさせ、推論中に元の効果のない位置を上書きして、既存のトレーニング長内でパフォーマンスを向上させます。実験結果によると、追加のトレーニングなしでSTRINGは、Llama3.1 70BやQwen2 72Bなどの最新の大規模モデルのパフォーマンスを飛躍的に向上させ、人気のある長いコンテキストのベンチマークであるRULERやInfiniteBenchで10ポイント以上の成績を収め、オープンソースのLLMsの最新の最先端の結果を確立します。商用モデルと比較して、\methodを使用したLlama 3.1 70Bは、GPT-4-128Kよりも優れたパフォーマンスを達成し、明らかにClaude 2やKimi-chatを凌駕しています。

English

Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

LLMの有効なコンテキスト長が不十分な理由は何ですか？

Why Does the Effective Context Length of LLMs Fall Short?

要旨

Support