ChatPaper.aiChatPaper

為了提高Transformer的效率而近似兩層前饋網絡

Approximating Two-Layer Feedforward Networks for Efficient Transformers

October 16, 2023
作者: Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
cs.AI

摘要

如何在不降低性能的情況下減少神經網絡(NNs)的計算和記憶體需求?許多最近的研究使用稀疏的專家混合(MoEs)來構建資源高效的大型語言模型(LMs)。在這裡,我們介紹了幾個關於MoEs的新觀點,提出了一個統一各種方法的通用框架,以近似兩層NNs(例如,Transformer的前饋塊)的方法,包括產品-鍵記憶(PKMs)。利用這個框架的見解,我們提出了改進MoEs和PKMs的方法。與先前將MoEs與密集基準在計算相等條件下進行比較的工作不同,我們的評估條件是參數相等,這對於正確評估LMs至關重要。我們展示了我們的MoEs在WikiText-103和enwiki8數據集上以兩種不同規模競爭密集的Transformer-XL,同時更加資源高效。這表明MoEs不僅與極大型的LMs相關,也與任何規模的資源高效LMs相關。我們的程式碼是公開的。
English
How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.
PDF113December 15, 2024