BlockFFN：面向终端加速友好的专家混合模型与块级激活稀疏性

摘要

為減輕大型語言模型（LLMs）的計算負擔，以專家混合（MoE）為代表的激活稀疏架構日益受到關注。然而，傳統MoE中不可微分且缺乏靈活性的路由機制損害了模型性能。此外，儘管每個令牌僅激活少量參數，這些稀疏激活架構在塊級稀疏性上表現較低，表明多個連續令牌的聯合激活了大部分參數。這種稀疏模式在低資源條件下（如終端設備）不利於加速，且與主流加速技術（如推測解碼）不相容。為應對這些挑戰，我們引入了一種新型MoE架構——BlockFFN，及其高效的訓練與部署技術。具體而言，我們採用結合ReLU激活與RMSNorm的路由器，實現了可微分且靈活的路由。接著，為提升令牌級稀疏性（TLS）與塊級稀疏性（CLS），我們設計了CLS感知的訓練目標，使BlockFFN更適合加速。最後，我們實現了高效的加速核心，首次將激活稀疏性與推測解碼相結合。實驗結果顯示，BlockFFN在多個MoE基線上表現優異，實現了超過80%的TLS與70%的8令牌CLS。我們的加速核心在實際終端設備上相比密集模型最高可達3.67倍的加速效果。所有代碼與檢查點均已公開（https://github.com/thunlp/BlockFFN）。

English

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

BlockFFN：面向终端加速友好的专家混合模型与块级激活稀疏性

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

摘要

Support