智能形塑者：以效率換取簡單

摘要

近年來，在自然語言處理和計算機視覺領域取得的成功主要歸功於Transformer。Transformer具有一個主要統一的骨幹，其中的層交替應用前馈和自注意力，以構建深度網絡。在這裡，我們研究了這種設計選擇，發現具有不同層基元排列的更複雜塊可以更有效率。利用這一見解，我們開發了一個名為Brainformer的複雜塊，其中包括各種層，如稀疏閘控前馈層、密集前馈層、注意力層，以及各種形式的層正規化和激活函數。Brainformer在質量和效率方面始終優於最先進的密集和稀疏Transformer。每個標記的激活參數為80億的Brainformer模型展現出2倍更快的訓練收斂速度，以及與其GLaM對應物相比5倍更快的步驟時間。在下游任務評估中，Brainformer在微調時展現出比具有相似激活參數數量的GLaM高3%的SuperGLUE分數。最後，Brainformer在少樣本評估中大幅優於使用類神經架構搜索獲得的Primer密集模型，其計算量與每個標記相似。

English

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.

智能形塑者：以效率換取簡單

Brainformers: Trading Simplicity for Efficiency

摘要

Support