SNLP：透過結構化牛頓修正的層級並行推論

摘要

自迴歸語言模型需要依序執行Transformer層，這會產生無法透過傳統張量並行或管線並行消除的延遲瓶頸。我們研究是否可以將隱藏狀態在各層間的軌跡視為非線性殘差方程的解，並以並行牛頓式更新來求解，從而放鬆這種逐層依賴關係。此觀點雖具理論基礎，但精確的牛頓修正需要昂貴的雅可比向量乘積，而樸素的定點疊代在已訓練的Transformer上則不穩定。我們提出結構化牛頓層平行（SNLP），這是一個訓練與推論框架，以廉價的、由架構誘導的替代動態取代精確的層雅可比矩陣。在殘差Transformer中，此方法衍生出恆等牛頓（IDN），其中修正項簡化為類似前綴和的更新；而在mHC風格的架構中，HC牛頓（HCN）則利用模型的殘差混合矩陣。我們進一步引入具SNLP意識的正則化，訓練模型使一次或少數幾次結構化牛頓疊代能夠準確近似序列前向傳播。在nanochat規模的Transformer上的實驗顯示，SNLP正則化可提升層平行兼容性，並能改善標準的序列困惑度，將基準PPL降低4.7%至23.4%。在推論時，SNLP結合層融合與分塊分解可實現實際的運行時間加速：在0.5B的Nanochat模型上，達到2.3倍的加速，同時仍將PPL改善6.1%。這些結果表明，層平行推論不僅是對序列執行的一種數值近似，還可作為一種有用的求解器誘導推論偏差。我們也指出了其局限性：現成的預訓練模型較不適用此程序，且精確收斂會回復到序列計算，並未提供單調的推論時間擴展。

English

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward. Experiments on nanochat-scale Transformers show that SNLP regularization improves layer-parallel compatibility and can also improve standard sequential perplexity, reducing baseline PPL by 4.7%-23.4%. At inference time, SNLP combined with layer fusion and chunkwise decomposition achieves practical wall-clock speedups: on a 0.5B Nanochat model, it reaches 2.3x speedup while still improving PPL by 6.1%. These results suggest that layer-parallel inference is not merely a numerical approximation to sequential execution, but can act as a useful solver-induced inference bias. We also characterize limitations: off-the-shelf pretrained models are less amenable to this procedure, and exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling.