用于高效Transformer的近似两层前馈网络
Approximating Two-Layer Feedforward Networks for Efficient Transformers
October 16, 2023
作者: Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
cs.AI
摘要
如何在不牺牲性能的情况下减少神经网络(NNs)的计算和内存需求?许多最近的研究使用稀疏的专家混合(MoEs)来构建资源高效的大型语言模型(LMs)。在这里,我们介绍了有关MoEs的几个新颖视角,提出了一个统一各种方法的通用框架,用于近似两层神经网络(例如,Transformer的前馈块)以及产品键记忆(PKMs)。利用这一框架的见解,我们提出了改进MoEs和PKMs的方法。与以往将MoEs与密集基线在计算相等条件下进行比较的研究不同,我们的评估条件是参数相等,这对正确评估LMs至关重要。我们展示了我们的MoEs在WikiText-103和enwiki8数据集上与密集Transformer-XL在两个不同规模上具有竞争力,同时更加资源高效。这表明MoEs不仅与极大型LMs相关,也与任何规模的资源高效LMs相关。我们的代码是公开的。
English
How to reduce compute and memory requirements of neural networks (NNs)
without sacrificing performance? Many recent works use sparse Mixtures of
Experts (MoEs) to build resource-efficient large language models (LMs). Here we
introduce several novel perspectives on MoEs, presenting a general framework
that unifies various methods to approximate two-layer NNs (e.g., feedforward
blocks of Transformers), including product-key memories (PKMs). Leveraging
insights from this framework, we propose methods to improve both MoEs and PKMs.
Unlike prior work that compares MoEs with dense baselines under the
compute-equal condition, our evaluation condition is parameter-equal, which is
crucial to properly evaluate LMs. We show that our MoEs are competitive with
the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two
different scales, while being much more resource efficient. This demonstrates
that MoEs are relevant not only to extremely large LMs but also to any-scale
resource-efficient LMs. Our code is public.