ChatPaper.aiChatPaper

BlockFFN:面向终端加速友好的分块级激活稀疏专家混合模型

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

July 11, 2025
作者: Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

为缓解大规模语言模型(LLMs)的计算负担,以专家混合(MoE)为代表的激活稀疏架构日益受到关注。然而,传统MoE中不可微分且缺乏灵活性的路由机制损害了模型性能。此外,尽管每个令牌仅激活少量参数,这类稀疏激活架构在块级别上表现出较低的稀疏性,意味着多个连续令牌的联合激活会涉及大量参数。这种稀疏模式在资源受限条件下(如终端设备)不利于加速,且与主流加速技术(如推测解码)不兼容。针对这些挑战,我们提出了一种新型MoE架构——BlockFFN,并开发了其高效训练与部署技术。具体而言,我们采用整合了ReLU激活与RMSNorm的路由器,以实现可微分且灵活的路由。接着,为提升令牌级别稀疏性(TLS)与块级别稀疏性(CLS),设计了CLS感知的训练目标,使BlockFFN更利于加速。最后,我们实现了高效的加速内核,首次将激活稀疏性与推测解码相结合。实验结果表明,BlockFFN在性能上优于其他MoE基线模型,实现了超过80%的TLS与70%的8令牌CLS。我们的内核在真实终端设备上相比密集模型最高可提速3.67倍。所有代码与检查点均已公开(https://github.com/thunlp/BlockFFN)。
English
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
PDF71July 14, 2025