FFN融合:重新思考大语言模型中的序列计算
FFN Fusion: Rethinking Sequential Computation in Large Language Models
March 24, 2025
作者: Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
cs.AI
摘要
我们推出FFN融合技术,这是一种架构优化方法,通过识别并利用自然并行化机会,减少大型语言模型中的顺序计算。我们的核心洞见是,前馈网络(FFN)层序列,特别是在移除特定注意力层后保留的序列,往往可以在最小化精度影响的情况下实现并行化。我们开发了一套系统的方法论,用于识别并融合这些序列,将其转化为并行操作,从而在保持模型行为的同时显著降低推理延迟。将这些技术应用于Llama-3.1-405B-Instruct模型,我们创建了Llama-Nemotron-Ultra-253B-Base(Ultra-253B-Base),这是一款高效且即将公开的模型,在保持强劲基准性能的同时,实现了推理延迟1.71倍的加速和每令牌成本35倍的降低。通过对49B至253B参数规模模型的广泛实验,我们证明FFN融合在更大规模上效果愈发显著,并能与量化、剪枝等现有优化技术相辅相成。尤为引人注目的是,我们发现,即使是包含注意力层和FFN层的完整Transformer块,有时也能实现并行化,这为神经网络架构设计开辟了新的方向。
English
We introduce FFN Fusion, an architectural optimization technique that reduces
sequential computation in large language models by identifying and exploiting
natural opportunities for parallelization. Our key insight is that sequences of
Feed-Forward Network (FFN) layers, particularly those remaining after the
removal of specific attention layers, can often be parallelized with minimal
accuracy impact. We develop a principled methodology for identifying and fusing
such sequences, transforming them into parallel operations that significantly
reduce inference latency while preserving model behavior. Applying these
techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base
(Ultra-253B-Base), an efficient and soon-to-be publicly available model that
achieves a 1.71X speedup in inference latency and 35X lower per-token cost
while maintaining strong performance across benchmarks. Through extensive
experiments on models from 49B to 253B parameters, we demonstrate that FFN
Fusion becomes increasingly effective at larger scales and can complement
existing optimization techniques like quantization and pruning. Most
intriguingly, we find that even full transformer blocks containing both
attention and FFN layers can sometimes be parallelized, suggesting new
directions for neural architecture design.Summary
AI-Generated Summary