전문가 연합: 동등하게 분해된 트랜스포머에 계층적 라우팅 적용

초록

전문가 혼합(Mixture-of-Experts, MoE)은 계산 효율성을 유지하면서 모델 성능을 향상시켜 대규모 응용 프로그램에 적합합니다. 그러나 기존 MoE 패러다임에서 전문가는 개별적으로 작동하여 고품질의 전문가 상호작용이 부족합니다. 또한, 이들은 어텐션 블록에 효과적으로 확장되지 않아 추가적인 효율성 개선이 제한됩니다. 이러한 문제를 해결하기 위해, 우리는 전문가 연합(Union-of-Experts, UoE)을 제안합니다. 이는 트랜스포머를 동등한 전문가 그룹으로 분해하고, 입력 데이터와 전문가에 대해 동적 라우팅을 구현합니다. 우리의 접근 방식은 세 가지 주요 혁신으로 MoE 설계를 발전시킵니다: (1) 텐서 병렬화에서 행렬 분할을 기반으로 MLP 블록과 어텐션 블록 모두에 대해 동등한 전문가 분해를 수행했습니다. (2) 패치 단위 데이터 선택과 전문가 선택이라는 두 가지 라우팅 패러다임을 개발하여 다양한 수준에서 라우팅을 적용했습니다. (3) 선택적 멀티-헤드 어텐션(Selective Multi-Head Attention, SMHA)과 MLP 전문가 연합(Union-of-MLP-Experts, UoME)을 포함한 UoE 모델의 아키텍처를 설계했습니다. (4) UoE의 라우팅과 계산 작업을 병렬로 구현하고, 하드웨어 처리 분석을 기반으로 효율성을 최적화했습니다. 실험 결과, UoE를 적용한 모델은 이미지 및 자연어 도메인에서 여러 작업에서 전체 어텐션(Full Attention), 최신 MoE 및 효율적인 트랜스포머를 능가하는 성능을 보였습니다. 소스 코드는 https://github.com/YujiaoYang-work/UoE에서 확인할 수 있습니다.

English

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

전문가 연합: 동등하게 분해된 트랜스포머에 계층적 라우팅 적용

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

초록

Support