BAM! 딱 그렇게: 혼합 전문가들을 위한 간단하고 효율적인 매개변수 업사이클링

초록

전문가 혼합 (Mixture of Experts, MoE) 프레임워크는 밀집 모델보다 우수한 성능으로 대형 언어 모델에 대한 인기 있는 아키텍처가 되었습니다. 그러나 대규모 regime에서 MoE를 처음부터 훈련하는 것은 막대한 비용이 듭니다. 기존 방법은 여러 밀집 전문가 모델을 독립적으로 사전 훈련하고 이를 사용하여 MoE를 초기화함으로써 이를 완화합니다. 이는 전문가의 피드포워드 네트워크(Feed-Forward Network, FFN)를 사용하여 MoE의 전문가를 초기화하고 다른 매개변수를 병합함으로써 수행됩니다. 그러나 이 방법은 밀집 모델 매개변수의 재사용을 FFN 레이어에만 제한하여 이러한 모델을 MoE로 "재활용"할 때 이점을 제약합니다. 우리는 이러한 결함을 해결하는 간단하면서도 효과적인 BAM (Branch-Attend-Mix) 방법을 제안합니다. BAM은 전문화된 밀집 모델을 완전히 활용하여 MoE 레이어를 초기화하는 것뿐만 아니라, 전문가의 주의 매개변수를 완전히 활용하여 Mixture of Attention (MoA) 레이어의 소프트 변형을 초기화함으로써 작동합니다. 주의 매개변수를 재활용하는 두 가지 방법을 탐구합니다: 1) 최상의 모델 성능을 위해 모든 주의 매개변수를 포함하는 밀집 모델로부터 별도의 주의 전문가를 초기화하는 것; 그리고 2) 모든 전문가 사이에서 주요 및 값 매개변수를 공유하여 더 나은 추론 효율성을 도모하는 것. 효율성을 더 향상시키기 위해 MoE에 병렬 주의 트랜스포머 아키텍처를 채택하여 주의 전문가와 FFN 전문가를 동시에 계산할 수 있습니다. 590백만에서 20억 개의 매개변수를 가진 시드 모델에 대한 실험 결과는, BAM이 동일한 계산 및 데이터 제약 조건 내에서 퍼플렉서티와 하류 작업 성능 모두에서 기준선을 능가함을 보여줍니다.

English

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.

BAM! 딱 그렇게: 혼합 전문가들을 위한 간단하고 효율적인 매개변수 업사이클링

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

초록

Support