FFN融合：重新思考大型語言模型中的序列計算

摘要

我們提出了FFN融合技術，這是一種架構優化方法，通過識別並利用自然存在的並行化機會來減少大型語言模型中的序列計算。我們的核心洞察是，前饋網絡（FFN）層的序列，尤其是在移除特定注意力層後保留的FFN層，通常可以在最小化精度影響的情況下實現並行化。我們開發了一種系統化的方法來識別並融合這些序列，將其轉化為並行操作，從而顯著降低推理延遲，同時保持模型行為。將這些技術應用於Llama-3.1-405B-Instruct模型，我們創建了Llama-Nemotron-Ultra-253B-Base（Ultra-253B-Base），這是一個高效且即將公開的模型，在保持強勁基準性能的同時，實現了推理延遲1.71倍的加速和每令牌成本降低35倍。通過對參數規模從49B到253B的模型進行廣泛實驗，我們證明了FFN融合在更大規模上變得更加有效，並且可以與量化、剪枝等現有優化技術互補。最引人注目的是，我們發現即使包含注意力層和FFN層的完整Transformer塊有時也能實現並行化，這為神經網絡架構設計開闢了新的方向。

English

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

FFN融合：重新思考大型語言模型中的序列計算

FFN Fusion: Rethinking Sequential Computation in Large Language Models

摘要

Support