パルカエ：安定したループ構造を持つ言語モデルのスケーリング法則

要旨

従来の固定深度アーキテクチャは、高いメモリ使用量やデータ量を犠牲にして、通常はパラメータ増加を通じて訓練FLOPsを増大させることで品質をスケーリングする。有望な代替案として、ループ構造アーキテクチャがあり、こちらは層ブロックをループさせて活性化を通過させることでFLOPsを増加させる。しかしながら、既存のループ構造アーキテクチャの訓練手法は不安定であり、残差爆発や損失スパイクに悩まされる可能性がある。我々は、ループ処理を残差ストリーム上の非線形時変力学系として再解釈することでこれらの課題に取り組む。このシステムの線形近似を通じて、既存のループ構造アーキテクチャにおける不安定性は、その注入パラメータの大きなスペクトルノルムが原因であることを明らかにする。これらの不安定性問題に対処するため、我々は負の対角パラメータ化の離散化を通じて注入パラメータのスペクトルノルムを制約する、新規で安定したループ構造アーキテクチャであるParcaeを提案する。その結果、Parcaeは従来の大規模ループモデルと比較して検証パープレキシティを最大6.3%低減させる。この安定したループ構造アーキテクチャを用いて、我々は訓練時およびテスト時のFLOPsを増加させることで品質を向上させる媒体としてのループ処理のスケーリング特性を調査する。訓練に関しては、パラメータ数を固定したままFLOPsをスケーリングするための予測可能なべき法則を導出する。我々の初期のスケーリング法則は、固定のFLOPs予算が与えられた場合、ループ処理とデータ量を組み合わせて増加させるべきであることを示唆する。テスト時において、Parcaeはループ処理を用いて計算量をスケーリングでき、それは予測可能な飽和指数関数的減衰に従うことを発見した。13億パラメータにスケールアップした場合、Parcaeは固定のパラメータとデータ予算の下で強力なTransformerベースラインと比較して、COREおよびCore-Extended品質をそれぞれ2.99ポイントおよび1.18ポイント改善し、サイズが2倍のTransformerの最大87.5%に相当する相対品質を達成する。

English

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

パルカエ：安定したループ構造を持つ言語モデルのスケーリング法則

Parcae: Scaling Laws For Stable Looped Language Models

要旨

Support