ChatPaper.aiChatPaper

SwitchHead:利用專家混合注意力加速Transformer

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

December 13, 2023
作者: Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
cs.AI

摘要

現代Transformer中昂貴的自注意力層需要記憶體和計算量與序列長度的平方成正比。現有的近似方法通常表現不佳,無法在實踐中獲得顯著的加速。在這裡,我們提出SwitchHead - 一種新穎的方法,可以減少計算和記憶體需求,實現時鐘速度加快,同時在相同參數預算下與基準Transformer的語言建模性能相匹配。SwitchHead使用專家混合(MoE)層進行值和輸出投影,比標準Transformer需要的注意力矩陣少4到8倍。我們的新型注意力還可以與MoE MLP層結合,形成高效的全MoE“SwitchAll” Transformer模型。我們的代碼是公開的。
English
The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead - a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchAll" Transformer model. Our code is public.
PDF412December 15, 2024