BlockFFN: 청크 수준 활성화 희소성을 활용한 종단 가속 친화적 전문가 혼합 모델

초록

대규모 언어 모델(LLM)의 계산 부담을 완화하기 위해, 전문가 혼합(MoE)으로 대표되는 활성화 희소성 아키텍처가 점점 더 많은 관심을 받고 있습니다. 그러나 기본 MoE의 미분 불가능하고 융통성 없는 라우팅은 모델 성능을 저하시킵니다. 또한, 각 토큰이 소수의 파라미터만 활성화하지만, 이러한 희소 활성화 아키텍처는 낮은 청크 수준의 희소성을 보여주며, 이는 여러 연속된 토큰의 합집합이 큰 비율의 파라미터를 활성화함을 의미합니다. 이러한 희소성 패턴은 저자원 환경(예: 엔드사이드 디바이스)에서의 가속화에 불리하며, 주류 가속화 기술(예: 스펙티브 디코딩)과도 호환되지 않습니다. 이러한 문제를 해결하기 위해, 우리는 새로운 MoE 아키텍처인 BlockFFN과 그 효율적인 학습 및 배포 기술을 소개합니다. 구체적으로, 우리는 ReLU 활성화와 RMSNorm을 통합한 라우터를 사용하여 미분 가능하고 유연한 라우팅을 구현합니다. 다음으로, 토큰 수준 희소성(TLS)과 청크 수준 희소성(CLS)을 모두 촉진하기 위해 CLS 인지 학습 목표를 설계하여 BlockFFN을 더욱 가속화에 친화적으로 만듭니다. 마지막으로, 활성화 희소성과 스펙티브 디코딩을 처음으로 결합한 효율적인 가속화 커널을 구현합니다. 실험 결과는 BlockFFN이 다른 MoE 기준 모델들을 능가하는 우수한 성능을 보여주며, 80% 이상의 TLS와 70%의 8-토큰 CLS를 달성함을 입증합니다. 우리의 커널은 실제 엔드사이드 디바이스에서 밀집 모델 대비 최대 3.67배의 속도 향상을 보여줍니다. 모든 코드와 체크포인트는 공개적으로 제공됩니다 (https://github.com/thunlp/BlockFFN).

English

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

BlockFFN: 청크 수준 활성화 희소성을 활용한 종단 가속 친화적 전문가 혼합 모델

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

초록

Support