FFN融合：大規模言語モデルにおける逐次計算の再考

要旨

本論文では、FFN Fusionというアーキテクチャ最適化技術を提案する。この技術は、大規模言語モデルにおける逐次計算を削減するため、並列化の自然な機会を特定し活用するものである。我々の重要な洞察は、特に特定のアテンションレイヤーを除去した後に残るFeed-Forward Network（FFN）層のシーケンスが、精度への影響を最小限に抑えつつ並列化可能であるという点にある。我々は、そのようなシーケンスを特定し融合するための体系的な方法論を開発し、それらを並列操作に変換することで、モデルの振る舞いを維持しつつ推論レイテンシを大幅に削減する。これらの技術をLlama-3.1-405B-Instructに適用し、Llama-Nemotron-Ultra-253B-Base（Ultra-253B-Base）を作成した。この効率的で近く公開予定のモデルは、推論レイテンシで1.71倍の高速化とトークンあたり35倍の低コストを実現しつつ、ベンチマーク全体で強力な性能を維持している。49Bから253Bパラメータまでのモデルを用いた広範な実験を通じて、FFN Fusionが規模が大きくなるほど効果的であり、量子化やプルーニングなどの既存の最適化技術を補完できることを示す。最も興味深いことに、アテンション層とFFN層の両方を含む完全なトランスフォーマーブロックでさえ、場合によっては並列化可能であることが判明し、ニューラルアーキテクチャ設計の新たな方向性を示唆している。

English

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

FFN融合：大規模言語モデルにおける逐次計算の再考

FFN Fusion: Rethinking Sequential Computation in Large Language Models

要旨

Support