Sparse-BitNet: 1,58-Bit-LLMs sind von Natur aus freundlich zu halbstrukturierter Sparsity

Zusammenfassung

Halbstrukturierte N:M-Sparsity und Low-Bit-Quantisierung (z.B. 1,58-Bit-BitNet) sind zwei vielversprechende Ansätze zur Verbesserung der Effizienz von Large Language Models (LLMs), wurden jedoch bisher weitgehend isoliert voneinander untersucht. In dieser Arbeit untersuchen wir ihre Wechselwirkung und zeigen, dass 1,58-Bit-BitNet von Natur aus besser mit N:M-Sparsity kompatibel ist als Modelle in Vollpräzision. Um diesen Effekt zu untersuchen, schlagen wir Sparse-BitNet vor, einen einheitlichen Rahmen, der erstmals gleichzeitig 1,58-Bit-Quantisierung und dynamische N:M-Sparsifizierung anwendet und dabei stabiles Training gewährleistet. Über verschiedene Modellgrößen und Trainingsregime hinweg (sparse Pretraining und Dense-to-Sparse-Ansätze) zeigt 1,58-Bit-BitNet durchgängig einen geringeren Leistungsabfall als Vollpräzision-Baselines bei gleichen Sparsity-Levels und toleriert höhere strukturierte Sparsity, bevor es zum Genauigkeitseinbruch kommt. Darüber hinaus erzielt Sparse-BitNet durch die Verwendung unserer maßgeschneiderten Sparse-Tensor-Cores erhebliche Beschleunigungen sowohl im Training als auch im Inference von bis zu 1,30X. Diese Ergebnisse unterstreichen, dass die Kombination von extrem niedriger Bit-Quantisierung mit halbstrukturierter N:M-Sparsity eine vielversprechende Richtung für effiziente LLMs ist. Code verfügbar unter https://github.com/AAzdi/Sparse-BitNet.

English

Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet

Sparse-BitNet: 1,58-Bit-LLMs sind von Natur aus freundlich zu halbstrukturierter Sparsity

Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

Zusammenfassung

Support