用于高效Transformer的近似两层前馈网络

Approximating Two-Layer Feedforward Networks for Efficient Transformers

October 16, 2023

作者: Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

cs.AI

摘要

如何在不牺牲性能的情况下减少神经网络（NNs）的计算和内存需求？许多最近的研究使用稀疏的专家混合（MoEs）来构建资源高效的大型语言模型（LMs）。在这里，我们介绍了有关MoEs的几个新颖视角，提出了一个统一各种方法的通用框架，用于近似两层神经网络（例如，Transformer的前馈块）以及产品键记忆（PKMs）。利用这一框架的见解，我们提出了改进MoEs和PKMs的方法。与以往将MoEs与密集基线在计算相等条件下进行比较的研究不同，我们的评估条件是参数相等，这对正确评估LMs至关重要。我们展示了我们的MoEs在WikiText-103和enwiki8数据集上与密集Transformer-XL在两个不同规模上具有竞争力，同时更加资源高效。这表明MoEs不仅与极大型LMs相关，也与任何规模的资源高效LMs相关。我们的代码是公开的。

English

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.