Quamba2: 선택적 상태 공간 모델을 위한 강력하고 확장 가능한 훈련 후 양자화 프레임워크

초록

상태 공간 모델(SSMs)은 일관된 메모리 사용과 높은 성능으로 인해 트랜스포머의 강력한 대안으로 부상하고 있습니다. 그러나 클라우드 서비스나 리소스가 제한된 장치에서 SSMs를 확장하는 것은 저장 공간 요구 사항과 계산 능력으로 인해 어려운 과제입니다. 이를 극복하기 위해 낮은 비트 폭 데이터 형식으로 SSMs를 양자화하면 모델 크기를 줄이고 하드웨어 가속의 이점을 얻을 수 있습니다. SSMs가 양자화로 인한 오류에 취약하기 때문에 최근 연구에서는 성능 저하 없이 효율성을 위해 특정 모델이나 비트 폭을 최적화하는 데 초점을 맞추고 있습니다. 그러나 대규모 배치 디코딩 속도를 높이기 위한 W4A8과 단일 사용자를 위한 짧은 프롬프트 애플리케이션에서 생성 속도를 향상시키기 위한 W4A16과 같이, 다양한 시나리오에 맞는 별도의 비트 폭 구성이 필수적입니다. 이를 위해 우리는 Mamba1과 Mamba2 백본 모두에 대해 W8A8, W4A8, W4A16과 호환되는 Quamba2를 제안하며, 다양한 플랫폼에서 SSM 배포에 대한 증가하는 수요를 해결합니다. SSMs의 채널 순서 보존과 활성화 지속성을 기반으로, 우리는 입력 x에 대해 정렬 및 클러스터링을 통해 선형 재귀의 입력을 8비트로 양자화하는 오프라인 접근 방식을 제안하고, 입력 종속 매개변수 B와 C에 대해 상태 그룹별 양자화를 결합합니다. SSM 출력에서 계산 불변성을 보장하기 위해, 우리는 클러스터링 순서에 따라 오프라인에서 가중치를 재배열합니다. 실험 결과, Quamba2-8B는 여러 최신 SSM 양자화 방법을 능가하며, 프리필링 및 생성 단계에서 각각 1.3배와 3배의 속도 향상을 제공하고, 평균 정확도 하락이 1.6%에 불과한 상태에서 4배의 메모리 감소를 제공합니다. MMLU에 대한 평가는 우리 프레임워크의 일반화성과 견고성을 보여줍니다. 코드와 양자화된 모델은 https://github.com/enyac-group/Quamba에서 공개될 예정입니다.

English

State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input x, combined with a per-state-group quantization for input-dependent parameters B and C. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms several state-of-the-art SSM quantization methods and delivers 1.3times and 3times speed-ups in the pre-filling and generation stages, respectively, while offering 4times memory reduction with only a 1.6% average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

Quamba2: 선택적 상태 공간 모델을 위한 강력하고 확장 가능한 훈련 후 양자화 프레임워크

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

초록

Support