효율적인 트랜스포머를 위한 2계층 순방향 신경망 근사화

초록

신경망(NNs)의 계산 및 메모리 요구량을 성능 저하 없이 어떻게 줄일 수 있을까? 최근 많은 연구들이 희소 전문가 혼합 모델(MoEs)을 사용하여 자원 효율적인 대규모 언어 모델(LMs)을 구축하고 있다. 본 논문에서는 MoEs에 대한 여러 새로운 관점을 소개하며, 트랜스포머의 피드포워드 블록과 같은 2층 신경망을 근사화하는 다양한 방법(예: 제품 키 메모리(PKMs))을 통합하는 일반적인 프레임워크를 제시한다. 이 프레임워크에서 얻은 통찰을 바탕으로, MoEs와 PKMs를 모두 개선하는 방법을 제안한다. 기존 연구들이 계산량이 동일한 조건에서 MoEs를 밀집 모델과 비교한 것과 달리, 본 연구에서는 매개변수가 동일한 조건에서 평가를 진행하며, 이는 언어 모델을 적절히 평가하는 데 중요하다. 우리의 MoEs가 두 가지 다른 규모에서 WikiText-103 및 enwiki8 데이터셋에서 밀집 Transformer-XL과 경쟁력을 유지하면서도 훨씬 더 자원 효율적임을 보여준다. 이는 MoEs가 극단적으로 큰 언어 모델뿐만 아니라 모든 규모의 자원 효율적인 언어 모델에도 적합함을 입증한다. 본 연구의 코드는 공개되어 있다.

English

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.

효율적인 트랜스포머를 위한 2계층 순방향 신경망 근사화

Approximating Two-Layer Feedforward Networks for Efficient Transformers

초록

Support