Liger: 大規模言語モデルの線形化とゲート付きリカレント構造への変換

要旨

線形リカレントモデリングを備えたトランスフォーマーは、線形時間での学習と定数メモリでの推論を実現します。その効率性と性能が実証されているにもかかわらず、このような非標準的なアーキテクチャをゼロから事前学習することは依然としてコストが高く、リスクを伴います。大規模言語モデル（LLM）の線形化は、事前学習済みの標準モデルを線形リカレント構造に変換し、より効率的な展開を可能にします。しかし、現在の線形化手法では、通常、追加の特徴マップモジュールを導入する必要があり、これには大規模なファインチューニングが必要であり、最先端の線形リカレントモデルで使用されるゲーティングメカニズムを見落としています。これらの問題に対処するため、本論文ではLiger（Linearizing LLMs to gated recurrent structuresの略）を提案します。Ligerは、事前学習済みのLLMをゲート付き線形リカレントモデルに変換する新しいアプローチであり、追加のパラメータを導入することなく実現します。事前学習済みのキーマトリックス重みを再利用して多様なゲーティングメカニズムを構築し、追加のコンポーネントをゼロから学習する必要なく、さまざまなゲート付きリカレント構造を形成します。Low-Rank Adaptation（LoRA）を用いた軽量なファインチューニングにより、Ligerは線形化されたゲート付きリカレントモデルの性能を元のLLMと同等に回復します。さらに、Liger Attentionという層内ハイブリッドアテンションメカニズムを導入し、線形化プロセス中に0.02％の事前学習トークンでTransformerベースのLLMの93％を回復し、1Bから8Bパラメータのモデルで検証されたように、複数のベンチマークで競争力のある結果を達成します。コードはhttps://github.com/OpenSparseLLMs/Linearizationで公開されています。

English

Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.

Liger: 大規模言語モデルの線形化とゲート付きリカレント構造への変換

Liger: Linearizing Large Language Models to Gated Recurrent Structures

要旨

Support