BlockFFN: チャンクレベルの活性化スパース性を活用したエンドサイド加速に適したエキスパートの混合モデル

要旨

大規模言語モデル（LLMs）の計算負荷を軽減するため、専門家混合モデル（Mixture-of-Experts, MoE）に代表される活性化スパース性を備えたアーキテクチャが注目を集めている。しかし、従来のMoEにおける非微分可能で柔軟性に欠けるルーティングは、モデルの性能を損なう。さらに、各トークンが少数のパラメータのみを活性化するにもかかわらず、これらのスパース活性化アーキテクチャはチャンクレベルのスパース性が低く、複数の連続するトークンの結合が多くのパラメータを活性化することを示している。このようなスパース性パターンは、低リソース環境（例：エンドサイドデバイス）での加速には不向きであり、主流の加速技術（例：投機的デコード）とも互換性がない。これらの課題に対処するため、我々は新たなMoEアーキテクチャであるBlockFFNと、その効率的な訓練および展開技術を提案する。具体的には、ReLU活性化とRMSNormを統合したルーターを使用し、微分可能で柔軟なルーティングを実現する。次に、トークンレベルのスパース性（TLS）とチャンクレベルのスパース性（CLS）の両方を促進するため、CLSを意識した訓練目標を設計し、BlockFFNを加速に適したものとする。最後に、活性化スパース性と投機的デコードを初めて組み合わせた効率的な加速カーネルを実装する。実験結果は、BlockFFNが他のMoEベースラインを上回る性能を示し、80%以上のTLSと70%の8トークンCLSを達成することを実証している。我々のカーネルは、実エンドサイドデバイス上で密モデルと比較して最大3.67倍の高速化を実現する。すべてのコードとチェックポイントは公開されている（https://github.com/thunlp/BlockFFN）。

English

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

BlockFFN: チャンクレベルの活性化スパース性を活用したエンドサイド加速に適したエキスパートの混合モデル

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

要旨

Support