FFN融合:重新思考大型語言模型中的序列計算
FFN Fusion: Rethinking Sequential Computation in Large Language Models
March 24, 2025
作者: Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
cs.AI
摘要
我們提出了FFN融合技術,這是一種架構優化方法,通過識別並利用自然存在的並行化機會來減少大型語言模型中的序列計算。我們的核心洞察是,前饋網絡(FFN)層的序列,尤其是在移除特定注意力層後保留的FFN層,通常可以在最小化精度影響的情況下實現並行化。我們開發了一種系統化的方法來識別並融合這些序列,將其轉化為並行操作,從而顯著降低推理延遲,同時保持模型行為。將這些技術應用於Llama-3.1-405B-Instruct模型,我們創建了Llama-Nemotron-Ultra-253B-Base(Ultra-253B-Base),這是一個高效且即將公開的模型,在保持強勁基準性能的同時,實現了推理延遲1.71倍的加速和每令牌成本降低35倍。通過對參數規模從49B到253B的模型進行廣泛實驗,我們證明了FFN融合在更大規模上變得更加有效,並且可以與量化、剪枝等現有優化技術互補。最引人注目的是,我們發現即使包含注意力層和FFN層的完整Transformer塊有時也能實現並行化,這為神經網絡架構設計開闢了新的方向。
English
We introduce FFN Fusion, an architectural optimization technique that reduces
sequential computation in large language models by identifying and exploiting
natural opportunities for parallelization. Our key insight is that sequences of
Feed-Forward Network (FFN) layers, particularly those remaining after the
removal of specific attention layers, can often be parallelized with minimal
accuracy impact. We develop a principled methodology for identifying and fusing
such sequences, transforming them into parallel operations that significantly
reduce inference latency while preserving model behavior. Applying these
techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base
(Ultra-253B-Base), an efficient and soon-to-be publicly available model that
achieves a 1.71X speedup in inference latency and 35X lower per-token cost
while maintaining strong performance across benchmarks. Through extensive
experiments on models from 49B to 253B parameters, we demonstrate that FFN
Fusion becomes increasingly effective at larger scales and can complement
existing optimization techniques like quantization and pruning. Most
intriguingly, we find that even full transformer blocks containing both
attention and FFN layers can sometimes be parallelized, suggesting new
directions for neural architecture design.Summary
AI-Generated Summary