Quamba2:一個針對選擇性狀態空間模型的穩健且可擴展的訓練後量化框架
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
March 28, 2025
作者: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
cs.AI
摘要
狀態空間模型(SSMs)因其穩定的記憶體使用和高性能,正逐漸成為Transformer的有力替代方案。然而,由於其存儲需求和計算能力的要求,在雲端服務或資源有限的設備上擴展SSMs仍面臨挑戰。為解決這一問題,使用低比特寬數據格式對SSMs進行量化可以減小模型規模,並受益於硬體加速。由於SSMs容易受到量化誤差的影響,近期的研究集中在優化特定模型或比特寬,以在不犧牲性能的前提下提升效率。然而,不同的場景需要不同的比特寬配置,例如W4A8用於提升大批量解碼速度,而W4A16則用於增強單用戶短提示應用的生成速度。為此,我們提出了Quamba2,兼容W8A8、W4A8和W4A16,適用於Mamba1和Mamba2架構,滿足SSM在多種平台上部署的日益增長需求。基於SSMs的通道順序保持和激活持久性,我們提出了一種離線方法,通過對輸入x進行排序和聚類,以8比特量化線性遞歸的輸入,並結合對輸入依賴參數B和C的每狀態組量化。為了確保SSM輸出的計算不變性,我們根據聚類序列離線重新排列權重。實驗表明,Quamba2-8B在多種最先進的SSM量化方法中表現優異,在預填充和生成階段分別實現了1.3倍和3倍的加速,同時提供了4倍的記憶體減少,僅帶來1.6%的平均準確率下降。在MMLU上的評估展示了我們框架的通用性和魯棒性。代碼和量化模型將在以下網址發布:https://github.com/enyac-group/Quamba。
English
State Space Models (SSMs) are emerging as a compelling alternative to
Transformers because of their consistent memory usage and high performance.
Despite this, scaling up SSMs on cloud services or limited-resource devices is
challenging due to their storage requirements and computational power. To
overcome this, quantizing SSMs with low bit-width data formats can reduce model
size and benefit from hardware acceleration. As SSMs are prone to
quantization-induced errors, recent efforts have focused on optimizing a
particular model or bit-width for efficiency without sacrificing performance.
However, distinct bit-width configurations are essential for different
scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for
enhancing generation speed in short prompt applications for a single user. To
this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both
Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment
on various platforms. Based on the channel order preserving and activation
persistence of SSMs, we propose an offline approach to quantize inputs of a
linear recurrence in 8-bit by sorting and clustering for input x, combined
with a per-state-group quantization for input-dependent parameters B and C.
To ensure compute-invariance in the SSM output, we rearrange weights offline
according to the clustering sequence. The experiments show that Quamba2-8B
outperforms several state-of-the-art SSM quantization methods and delivers
1.3times and 3times speed-ups in the pre-filling and generation stages,
respectively, while offering 4times memory reduction with only a 1.6%
average accuracy drop. The evaluation on MMLU shows the generalizability and
robustness of our framework. The code and quantized models will be released at:
https://github.com/enyac-group/Quamba.Summary
AI-Generated Summary