SNLP: 構造化ニュートン補正による層並列推論

要旨

自己回帰型言語モデルはTransformer層を逐次的に実行するため、従来のテンソル並列性やパイプラインペラリズムでは除去できないレイテンシボトルネックが生じる。本研究では、層間の依存関係を、層を跨ぐ隠れ状態のトレースを非線形残差方程式の解と見なして並列ニュートン型更新で解くことで緩和できるかどうかを検討する。この視点は原理的に妥当であるが、厳密なニュートン補正には高コストなヤコビアン-ベクトル積が必要であり、また単純な不動点反復は学習済みTransformerでは不安定である。そこで、厳密な層ヤコビアンを、アーキテクチャから生じる安価な代理ダイナミクスに置き換える訓練・推論フレームワーク、構造化ニュートン層並列性（SNLP）を導入する。残差型Transformerでは、補正がプレフィックス和型更新に帰着するIdentity Newton（IDN）が得られる。mHC型アーキテクチャでは、モデルの残差混合行列を利用したHC Newton（HCN）が得られる。さらに、1回または少数の構造化ニュートン反復で逐次的な順伝搬を正確に近似できるようモデルを訓練するSNLP対応正則化を導入する。Nanochat規模のTransformer実験では、SNLP正則化により層並列互換性が向上し、標準的な逐次的パープレキシティも改善され、ベースラインPPLを4.7%～23.4%削減した。推論時には、SNLPと層融合およびチャンク単位分解を組み合わせることで実時間高速化を達成し、0.5B Nanochatモデルでは2.3倍の高速化と同時にPPLを6.1%改善した。これらの結果は、層並列推論が単なる逐次実行の数値近似ではなく、ソルバー起因の有用な推論バイアスとして機能し得ることを示唆している。また、市販の事前学習モデルは本手法の適用が容易でないこと、厳密な収束は逐次計算を回復するものであり単調な推論時スケーリングを提供しないことといった限界も明らかにする。

English

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward. Experiments on nanochat-scale Transformers show that SNLP regularization improves layer-parallel compatibility and can also improve standard sequential perplexity, reducing baseline PPL by 4.7%-23.4%. At inference time, SNLP combined with layer fusion and chunkwise decomposition achieves practical wall-clock speedups: on a 0.5B Nanochat model, it reaches 2.3x speedup while still improving PPL by 6.1%. These results suggest that layer-parallel inference is not merely a numerical approximation to sequential execution, but can act as a useful solver-induced inference bias. We also characterize limitations: off-the-shelf pretrained models are less amenable to this procedure, and exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling.