SwitchHead: 전문가 혼합 어텐션을 통한 트랜스포머 가속화

초록

현대 트랜스포머(Transformer)의 비용이 많이 드는 self-attention 레이어는 시퀀스 길이에 대해 메모리와 계산이 2차적으로 증가합니다. 기존의 근사 방법들은 일반적으로 성능이 떨어지고 실제로 상당한 속도 향상을 얻지 못합니다. 여기서 우리는 SwitchHead라는 새로운 방법을 제시합니다. 이 방법은 계산 및 메모리 요구 사항을 줄이고, 벽시계 속도 향상을 달성하면서 동일한 파라미터 예산을 가진 기준 트랜스포머의 언어 모델링 성능을 유지합니다. SwitchHead는 값(value) 및 출력(output) 투영에 Mixture-of-Experts(MoE) 레이어를 사용하며, 표준 트랜스포머보다 4~8배 적은 attention 행렬을 필요로 합니다. 우리의 새로운 attention은 MoE MLP 레이어와도 결합될 수 있어, 효율적인 완전 MoE "SwitchAll" 트랜스포머 모델을 만들어냅니다. 우리의 코드는 공개되어 있습니다.

English

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead - a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchAll" Transformer model. Our code is public.

SwitchHead: 전문가 혼합 어텐션을 통한 트랜스포머 가속화

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

초록

Support