ChatPaper.aiChatPaper

智能形塑者:以效率換取簡單

Brainformers: Trading Simplicity for Efficiency

May 29, 2023
作者: Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laundon, Jeff Dean
cs.AI

摘要

近年來,在自然語言處理和計算機視覺領域取得的成功主要歸功於Transformer。Transformer具有一個主要統一的骨幹,其中的層交替應用前馈和自注意力,以構建深度網絡。在這裡,我們研究了這種設計選擇,發現具有不同層基元排列的更複雜塊可以更有效率。利用這一見解,我們開發了一個名為Brainformer的複雜塊,其中包括各種層,如稀疏閘控前馈層、密集前馈層、注意力層,以及各種形式的層正規化和激活函數。Brainformer在質量和效率方面始終優於最先進的密集和稀疏Transformer。每個標記的激活參數為80億的Brainformer模型展現出2倍更快的訓練收斂速度,以及與其GLaM對應物相比5倍更快的步驟時間。在下游任務評估中,Brainformer在微調時展現出比具有相似激活參數數量的GLaM高3%的SuperGLUE分數。最後,Brainformer在少樣本評估中大幅優於使用類神經架構搜索獲得的Primer密集模型,其計算量與每個標記相似。
English
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
PDF11December 15, 2024