HGRN2: 状態拡張を伴うゲート付き線形RNN

要旨

階層的にゲートされた線形RNN（HGRN、Qin et al. 2023）は、言語モデリングにおいて競争力のある学習速度と性能を示し、効率的な推論を実現しています。しかし、HGRNの再帰状態のサイズは比較的小さく、その表現力が制限されています。この問題に対処するため、線形アテンションに着想を得て、追加のパラメータを導入することなく再帰状態のサイズを大幅に拡大するシンプルな外積ベースの状態拡張メカニズムを提案します。線形アテンションの形式は、ハードウェア効率の良い学習も可能にします。我々の広範な実験により、HGRN2がHGRN1を上回る利点が、言語モデリング、画像分類、およびLong Range Arenaにおいて確認されました。最大の3B HGRN2モデルは、制御された実験設定において言語モデリングでMambaやLLaMa Architecture Transformerをわずかに上回り、下流評価では多くのオープンソースの3Bモデルと競争力のある性能を示しながら、総学習トークン数を大幅に削減しました。

English

Hierarchically gated linear RNN (HGRN,Qin et al. 2023) has demonstrated competitive training speed and performance in language modeling, while offering efficient inference. However, the recurrent state size of HGRN remains relatively small, which limits its expressiveness.To address this issue, inspired by linear attention, we introduce a simple outer-product-based state expansion mechanism so that the recurrent state size can be significantly enlarged without introducing any additional parameters. The linear attention form also allows for hardware-efficient training.Our extensive experiments verify the advantage of HGRN2 over HGRN1 in language modeling, image classification, and Long Range Arena.Our largest 3B HGRN2 model slightly outperforms Mamba and LLaMa Architecture Transformer for language modeling in a controlled experiment setting; and performs competitively with many open-source 3B models in downstream evaluation while using much fewer total training tokens.

HGRN2: 状態拡張を伴うゲート付き線形RNN

HGRN2: Gated Linear RNNs with State Expansion

要旨

Support