BlockFFN：面向终端加速友好的分块级激活稀疏专家混合模型

摘要

为缓解大规模语言模型（LLMs）的计算负担，以专家混合（MoE）为代表的激活稀疏架构日益受到关注。然而，传统MoE中不可微分且缺乏灵活性的路由机制损害了模型性能。此外，尽管每个令牌仅激活少量参数，这类稀疏激活架构在块级别上表现出较低的稀疏性，意味着多个连续令牌的联合激活会涉及大量参数。这种稀疏模式在资源受限条件下（如终端设备）不利于加速，且与主流加速技术（如推测解码）不兼容。针对这些挑战，我们提出了一种新型MoE架构——BlockFFN，并开发了其高效训练与部署技术。具体而言，我们采用整合了ReLU激活与RMSNorm的路由器，以实现可微分且灵活的路由。接着，为提升令牌级别稀疏性（TLS）与块级别稀疏性（CLS），设计了CLS感知的训练目标，使BlockFFN更利于加速。最后，我们实现了高效的加速内核，首次将激活稀疏性与推测解码相结合。实验结果表明，BlockFFN在性能上优于其他MoE基线模型，实现了超过80%的TLS与70%的8令牌CLS。我们的内核在真实终端设备上相比密集模型最高可提速3.67倍。所有代码与检查点均已公开（https://github.com/thunlp/BlockFFN）。

English

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

BlockFFN：面向终端加速友好的分块级激活稀疏专家混合模型

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

摘要

Support