LM-Infinite: 大規模言語モデルのためのシンプルなオンザフライ長さ一般化

要旨

近年、Transformerベースの大規模言語モデル（LLM）の性能は、さまざまな領域で目覚ましい進歩を遂げています。これらのLLMがより複雑なタスクに適用されるにつれ、より長い推論プロセスを実行したり、より大きな文脈を理解したりする必要性が高まっています。このような状況では、長いシーケンスに対するLLMの長さ一般化の失敗がより顕著になります。ほとんどの事前学習スキームでは、学習シーケンスを固定長（例えばLLaMaの2048）に切り詰めます。LLMは、相対的位置エンコーディングがこの問題に対処するために設計されているにもかかわらず、長い文脈の後に流暢なテキストを生成することに苦労し、下流タスクを実行することはなおさら困難です。長いコーパスでのファインチューニングなどの一般的な解決策は、多大なハードウェアと時間コストを伴い、慎重な学習プロセスの設計を必要とします。既存のLLMの生成能力をより効率的に活用するために、私たちはこの問題に寄与する主な分布外（OOD）要因を理論的および実証的に調査しました。この診断に基づき、私たちは即時の長さ一般化のためのシンプルで効果的な解決策、LM-Infiniteを提案します。これは、ラムダ型のアテンションマスクと距離制限のみを含み、パラメータの更新や学習を必要としません。私たちは、相対位置エンコーディング方法を使用するさまざまなLLMに適用可能であることを発見しました。LM-Infiniteは、O(n)の時間と空間で計算効率が良く、ArXivおよびOpenWebText2データセットで最大32kトークンまでの一貫した流暢さと生成品質を示し、2.72倍のデコード速度向上を実現しました。パスキー検索などの下流タスクでは、訓練長をはるかに超える入力に対して、通常のモデルが即座に失敗する状況でも機能し続けます。

English

In recent years, there have been remarkable advancements in the performance of Transformer-based Large Language Models (LLMs) across various domains. As these LLMs are deployed for increasingly complex tasks, they often face the needs to conduct longer reasoning processes or understanding larger contexts. In these situations, the length generalization failure of LLMs on long sequences become more prominent. Most pre-training schemes truncate training sequences to a fixed length (such as 2048 for LLaMa). LLMs often struggle to generate fluent texts, let alone carry out downstream tasks, after longer contexts, even with relative positional encoding which is designed to cope with this problem. Common solutions such as finetuning on longer corpora often involves daunting hardware and time costs and requires careful training process design. To more efficiently leverage the generation capacity of existing LLMs, we theoretically and empirically investigate the main out-of-distribution (OOD) factors contributing to this problem. Inspired by this diagnosis, we propose a simple yet effective solution for on-the-fly length generalization, LM-Infinite, which involves only a Lambda-shaped attention mask and a distance limit while requiring no parameter updates or learning. We find it applicable to a variety of LLMs using relative-position encoding methods. LM-Infinite is computational efficient with O(n) time and space, and demonstrates consistent fluency and generation quality to as long as 32k tokens on ArXiv and OpenWebText2 datasets, with 2.72x decoding speedup. On downstream task such as passkey retrieval, it continues to work on inputs much longer than training lengths where vanilla models fail immediately.

LM-Infinite: 大規模言語モデルのためのシンプルなオンザフライ長さ一般化

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

要旨

Support