効率的なトランスフォーマーのための二層フィードフォワードネットワークの近似

要旨

ニューラルネットワーク（NN）の計算量とメモリ要件を性能を犠牲にすることなく削減するにはどうすればよいか？近年の多くの研究では、リソース効率の高い大規模言語モデル（LM）を構築するために、スパースなMixture of Experts（MoE）が使用されている。本論文では、MoEに関するいくつかの新しい視点を紹介し、2層NN（例えば、Transformerのフィードフォワードブロック）を近似するための様々な手法を統合する一般的なフレームワークを提示する。これには、Product-Key Memories（PKM）も含まれる。このフレームワークからの洞察を活用し、MoEとPKMの両方を改善する手法を提案する。従来の研究では、MoEを計算量が等しい条件下での密なベースラインと比較していたが、我々の評価条件はパラメータ数が等しいものであり、これはLMを適切に評価する上で重要である。我々のMoEは、WikiText-103とenwiki8の2つのデータセットにおいて、異なるスケールで密なTransformer-XLと競合しつつ、はるかにリソース効率が高いことを示す。これは、MoEが極めて大規模なLMだけでなく、あらゆるスケールのリソース効率の高いLMにも関連があることを示している。我々のコードは公開されている。

English

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.

効率的なトランスフォーマーのための二層フィードフォワードネットワークの近似

Approximating Two-Layer Feedforward Networks for Efficient Transformers

要旨

Support