グループ認識型SSMプルーニングによる効率的なハイブリッド言語モデル圧縮

要旨

AttentionとState Space Models（SSMs）を組み合わせたハイブリッドLLMアーキテクチャは、最先端の精度と実行時性能を達成しています。最近の研究では、Attentionのみのモデルに圧縮と蒸留を適用することで、トレーニングコストの一部でより小型かつ高精度なモデルが得られることが実証されています。本研究では、ハイブリッドアーキテクチャの圧縮効果を探求します。我々は、SSMブロックの構造的整合性とシーケンスモデリング能力を維持する新しいグループ対応プルーニング戦略を導入します。さらに、従来のアプローチと比較して精度と推論速度を向上させるためには、このようなSSMプルーニングが必要であることを実証します。我々の圧縮レシピは、SSM、FFN、埋め込み次元、およびレイヤープルーニングを組み合わせ、その後MINITRON技術と同様の知識蒸留に基づく再トレーニングを行います。このアプローチを用いて、Nemotron-H 8Bハイブリッドモデルを4Bパラメータまで圧縮し、トレーニングトークンを最大40分の1に削減しました。その結果得られたモデルは、同サイズのモデルを上回る精度を達成し、2倍の推論速度を実現することで、パレートフロンティアを大幅に前進させました。

English

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

グループ認識型SSMプルーニングによる効率的なハイブリッド言語モデル圧縮

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

要旨

Support