ChatPaper.aiChatPaper

BlockFFN:面向终端加速友好的专家混合模型与块级激活稀疏性

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

July 11, 2025
作者: Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

為減輕大型語言模型(LLMs)的計算負擔,以專家混合(MoE)為代表的激活稀疏架構日益受到關注。然而,傳統MoE中不可微分且缺乏靈活性的路由機制損害了模型性能。此外,儘管每個令牌僅激活少量參數,這些稀疏激活架構在塊級稀疏性上表現較低,表明多個連續令牌的聯合激活了大部分參數。這種稀疏模式在低資源條件下(如終端設備)不利於加速,且與主流加速技術(如推測解碼)不相容。為應對這些挑戰,我們引入了一種新型MoE架構——BlockFFN,及其高效的訓練與部署技術。具體而言,我們採用結合ReLU激活與RMSNorm的路由器,實現了可微分且靈活的路由。接著,為提升令牌級稀疏性(TLS)與塊級稀疏性(CLS),我們設計了CLS感知的訓練目標,使BlockFFN更適合加速。最後,我們實現了高效的加速核心,首次將激活稀疏性與推測解碼相結合。實驗結果顯示,BlockFFN在多個MoE基線上表現優異,實現了超過80%的TLS與70%的8令牌CLS。我們的加速核心在實際終端設備上相比密集模型最高可達3.67倍的加速效果。所有代碼與檢查點均已公開(https://github.com/thunlp/BlockFFN)。
English
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
PDF71July 14, 2025