ChatPaper.aiChatPaper

SwitchHead:利用专家混合注意力加速Transformer

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

December 13, 2023
作者: Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
cs.AI

摘要

现代Transformer中昂贵的自注意力层需要与序列长度成二次关系的内存和计算。现有的近似方法通常表现不佳,无法在实践中获得显著的加速。在这里,我们提出SwitchHead - 一种新颖的方法,可以减少计算和内存需求,并实现挂钟加速,同时在与具有相同参数预算的基准Transformer相匹配的语言建模性能。SwitchHead使用专家混合(MoE)层进行值和输出投影,并且比标准Transformer需要的注意力矩阵少4到8倍。我们的新型注意力还可以与MoE MLP层相结合,从而产生高效的完全MoE“SwitchAll”Transformer模型。我们的代码是公开的。
English
The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead - a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchAll" Transformer model. Our code is public.
PDF412December 15, 2024