Quamba2：選択的状態空間モデルのための堅牢かつスケーラブルなポストトレーニング量子化フレームワーク

要旨

State Space Models（SSMs）は、メモリ使用量の一貫性と高いパフォーマンスから、Transformerの有力な代替として注目を集めています。しかし、SSMsをクラウドサービスやリソースが限られたデバイスでスケールアップすることは、ストレージ要件と計算能力の点で課題があります。これを克服するため、低ビット幅のデータ形式でSSMsを量子化することで、モデルサイズを削減し、ハードウェアアクセラレーションの恩恵を受けることができます。SSMsは量子化によるエラーが発生しやすいため、最近の研究では、性能を犠牲にせずに効率を最大化するために特定のモデルやビット幅を最適化することに焦点が当てられています。しかし、異なるシナリオでは異なるビット幅設定が重要であり、例えば大規模バッチデコード速度を向上させるためのW4A8や、単一ユーザーの短いプロンプトアプリケーションでの生成速度を向上させるためのW4A16などがあります。これに対応するため、我々はQuamba2を提案します。Quamba2は、Mamba1とMamba2のバックボーンに対応し、W8A8、W4A8、W4A16のビット幅設定をサポートし、様々なプラットフォームでのSSM展開の需要に応えます。SSMsのチャネル順序保存とアクティベーション持続性に基づき、線形再帰の入力を8ビットで量子化するためのオフラインアプローチを提案します。これは、入力xに対してソートとクラスタリングを組み合わせ、入力依存パラメータBとCに対しては状態グループごとの量子化を行います。SSM出力の計算不変性を保証するため、クラスタリングシーケンスに従って重みをオフラインで再配置します。実験結果では、Quamba2-8Bがいくつかの最先端のSSM量子化手法を上回り、プリフィル段階で1.3倍、生成段階で3倍の高速化を実現し、メモリ使用量を4分の1に削減しながら、平均精度の低下はわずか1.6%でした。MMLUでの評価は、我々のフレームワークの汎用性と堅牢性を示しています。コードと量子化モデルは、https://github.com/enyac-group/Quamba で公開されます。

English

State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input x, combined with a per-state-group quantization for input-dependent parameters B and C. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms several state-of-the-art SSM quantization methods and delivers 1.3times and 3times speed-ups in the pre-filling and generation stages, respectively, while offering 4times memory reduction with only a 1.6% average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

Quamba2：選択的状態空間モデルのための堅牢かつスケーラブルなポストトレーニング量子化フレームワーク

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

要旨

Support