FFN融合：重新思考大语言模型中的序列计算

摘要

我们推出FFN融合技术，这是一种架构优化方法，通过识别并利用自然并行化机会，减少大型语言模型中的顺序计算。我们的核心洞见是，前馈网络（FFN）层序列，特别是在移除特定注意力层后保留的序列，往往可以在最小化精度影响的情况下实现并行化。我们开发了一套系统的方法论，用于识别并融合这些序列，将其转化为并行操作，从而在保持模型行为的同时显著降低推理延迟。将这些技术应用于Llama-3.1-405B-Instruct模型，我们创建了Llama-Nemotron-Ultra-253B-Base（Ultra-253B-Base），这是一款高效且即将公开的模型，在保持强劲基准性能的同时，实现了推理延迟1.71倍的加速和每令牌成本35倍的降低。通过对49B至253B参数规模模型的广泛实验，我们证明FFN融合在更大规模上效果愈发显著，并能与量化、剪枝等现有优化技术相辅相成。尤为引人注目的是，我们发现，即使是包含注意力层和FFN层的完整Transformer块，有时也能实现并行化，这为神经网络架构设计开辟了新的方向。

English

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

FFN融合：重新思考大语言模型中的序列计算

FFN Fusion: Rethinking Sequential Computation in Large Language Models

摘要

Support