Lizard: 大規模言語モデルのための効率的な線形化フレームワーク

要旨

我々は、事前学習済みTransformerベースの大規模言語モデル（LLM）を、無限コンテキスト生成のための柔軟なサブ二次元アーキテクチャに変換する線形化フレームワーク「Lizard」を提案する。TransformerベースのLLMは、コンテキスト長が増加するにつれて、ソフトマックスアテンションの二次元複雑性と増大するキー・バリュー（KV）キャッシュにより、メモリと計算上のボトルネックに直面する。Lizardは、ソフトマックスアテンションを密接に近似しつつ出力品質を維持するサブ二次元アテンションメカニズムを導入することで、これらの制限に対処する。固定モデル構造に制約される従来の線形化手法とは異なり、Lizardは最新の線形モデルにインスパイアされたゲーティングモジュールを組み込む。これにより、適応的なメモリ制御、定数メモリ推論のサポート、強力な長さ一般化、そしてより柔軟なモデル設計が可能となる。Lizardは、グローバルコンテキスト圧縮のためのゲート付き線形アテンションと、メタメモリで強化されたスライディングウィンドウアテンションを組み合わせ、長距離依存性と細粒度の局所的相互作用の両方を捉えるハイブリッドメカニズムを形成する。さらに、ハードウェアを意識したアルゴリズムを導入し、モデルの学習速度を加速する。大規模な実験により、Lizardは標準的な言語モデリングタスクにおいて教師モデルの性能をほぼロスレスで回復しつつ、従来の線形化手法を大幅に上回ることが示された。5-shot MMLUベンチマークでは、Lizardは先行モデルを18ポイント上回り、連想想起タスクにおいても大幅な改善を示した。

English

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.

Lizard: 大規模言語モデルのための効率的な線形化フレームワーク

Lizard: An Efficient Linearization Framework for Large Language Models

要旨

Support