ChatPaper.aiChatPaper

大脑塑形者:以效率换取简单性

Brainformers: Trading Simplicity for Efficiency

May 29, 2023
作者: Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laundon, Jeff Dean
cs.AI

摘要

近年来在自然语言处理和计算机视觉领域取得的重大成功与Transformer密不可分。Transformer具有一个基本统一的架构,其中的层交替进行前馈和自注意力操作,以构建一个深度网络。在本研究中,我们探讨了这种设计选择,并发现具有不同层基元排列的更复杂的模块可能更有效。基于这一观察,我们开发了一个名为Brainformer的复杂模块,其中包括各种层,如稀疏门控前馈层、密集前馈层、注意力层以及各种形式的层归一化和激活函数。Brainformer在质量和效率方面始终优于最先进的密集和稀疏Transformer。每个标记激活参数为80亿的Brainformer模型表现出比其GLaM对应模型快2倍的训练收敛速度和快5倍的步长时间。在下游任务评估中,相较于具有相似激活参数数量的GLaM,Brainformer在微调后的SuperGLUE得分高出3%。最后,Brainformer在少样本评估中大幅优于使用类似计算量的NAS获得的Primer密集模型。
English
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
PDF11December 15, 2024