SNLP：通过结构化牛顿校正的层并行推理

摘要

自回归语言模型按顺序执行Transformer层，由此产生的延迟瓶颈无法通过常规的张量或流水线并行消除。我们研究能否通过将跨层的隐藏状态轨迹视为非线性残差方程的解，并采用并行牛顿类更新进行求解，从而放松这种逐层依赖关系。尽管这一视角具有理论依据，但精确的牛顿校正需要昂贵的雅可比-向量积计算，且朴素的不动点迭代在训练后的Transformer上不稳定。为此，我们提出结构化牛顿层并行（SNLP）——一种训练与推理框架，用廉价的结构诱导代理动力学替代精确的层雅可比矩阵。在残差Transformer中，这衍生出恒等牛顿（IDN），其校正简化为前缀和类更新；在mHC类架构中，HC牛顿（HCN）利用模型的残差混合矩阵。我们进一步引入SNLP感知正则化，训练模型使得一次或少数几次结构化牛顿迭代能精确逼近顺序前向传播。在nanochat规模Transformer上的实验表明，SNLP正则化提升了层并行兼容性，且能改善标准顺序困惑度，使基线PPL降低4.7%-23.4%。推理阶段，SNLP结合层融合与分块分解实现了实际加速：在0.5B Nanochat模型上获得2.3倍加速的同时，PPL仍降低6.1%。这些结果表明，层并行推理不仅是顺序执行的数值近似，还可作为有效的求解器诱导推理偏置。我们也指出局限性：现成的预训练模型对此过程适应性较差，且精确收敛会恢复顺序计算而非提供单调的推理时缩放效果。

English

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward. Experiments on nanochat-scale Transformers show that SNLP regularization improves layer-parallel compatibility and can also improve standard sequential perplexity, reducing baseline PPL by 4.7%-23.4%. At inference time, SNLP combined with layer fusion and chunkwise decomposition achieves practical wall-clock speedups: on a 0.5B Nanochat model, it reaches 2.3x speedup while still improving PPL by 6.1%. These results suggest that layer-parallel inference is not merely a numerical approximation to sequential execution, but can act as a useful solver-induced inference bias. We also characterize limitations: off-the-shelf pretrained models are less amenable to this procedure, and exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling.