StateSMix: Online-Verlustlose-Kompression mittels Mamba-Zustandsraummodellen und sparsamer N-Gramm-Kontextmischung

Zusammenfassung

Wir stellen StateSMix vor, einen vollständig eigenständigen verlustlosen Komprimierer, der ein online trainiertes State Space Model (SSM) im Mamba-Stil mit sparsamem n-gramm-Kontextmixing und arithmetischer Kodierung kombiniert. Das Modell wird von Grund auf initialisiert und tokenweise auf der zu komprimierenden Datei trainiert, benötigt also keine vortrainierten Gewichte, keine GPU und keine externen Abhängigkeiten. Das SSM (DM=32, NL=2, ca. 120K aktive Parameter pro Datei) liefert eine kontinuierlich aktualisierte Wahrscheinlichkeitsschätzung für BPE-Tokens, während neun spärliche n-gramm-Hashtabellen (Bigramm bis 32-gramm, je 16M Slots) eine exakte Speicherung lokaler und langreichweitiger Muster über einen softmax-invarianten Logit-Bias-Mechanismus hinzufügen, der nur Tokens mit einem Zähler ungleich Null aktualisiert. Ein entropieadaptiver Skalierungsmechanismus moduliert den n-gramm-Beitrag basierend auf dem Vorhersagevertrauen des SSM und verhindert so eine Überkorrektur, wenn das neuronale Modell bereits gut kalibriert ist. Im standardmäßigen enwik8-Benchmark erzielt StateSMix 2,123 bpb auf 1 MB, 2,149 bpb auf 3 MB und 2,162 bpb auf 10 MB und übertrifft damit xz -9e (LZMA2) um 8,7 %, 5,4 % bzw. 0,7 %. Ablationsexperimente bestätigen das SSM als dominante Komprimierungskomponente: Es allein bewirkt eine 46,6 %ige Größenreduktion gegenüber einem Häufigkeitszähl-Baseline und übertrifft xz ohne jegliche n-gramm-Komponente, während die n-gramm-Tabellen eine komplementäre Steigerung von 4,1 % durch exakte Kontextspeicherung liefern. Eine OpenMP-Parallelisierung der Trainingsschleife ergibt eine 1,9-fache Beschleunigung auf 4 Kernen. Das System ist in reinem C mit AVX2-SIMD implementiert und verarbeitet auf handelsüblicher x86-64-Hardware etwa 2.000 Tokens pro Sekunde.

English

We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model (SSM) with sparse n-gram context mixing and arithmetic coding. The model is initialised from scratch and trained token-by-token on the file being compressed, requiring no pre-trained weights, no GPU, and no external dependencies. The SSM (DM=32, NL=2, approximately 120K active parameters per file) provides a continuously-updated probability estimate over BPE tokens, while nine sparse n-gram hash tables (bigram through 32-gram, 16M slots each) add exact local and long-range pattern memorisation via a softmax-invariant logit-bias mechanism that updates only non-zero-count tokens. An entropy-adaptive scaling mechanism modulates the n-gram contribution based on the SSM's predictive confidence, preventing over-correction when the neural model is already well-calibrated. On the standard enwik8 benchmark, StateSMix achieves 2.123 bpb on 1 MB, 2.149 bpb on 3 MB, and 2.162 bpb on 10 MB, beating xz -9e (LZMA2) by 8.7%, 5.4%, and 0.7% respectively. Ablation experiments establish the SSM as the dominant compression engine: it alone accounts for a 46.6% size reduction over a frequency-count baseline and beats xz without any n-gram component, while n-gram tables provide a complementary 4.1% gain through exact context memorisation. OpenMP parallelisation of the training loop yields 1.9x speedup on 4 cores. The system is implemented in pure C with AVX2 SIMD and processes approximately 2,000 tokens per second on commodity x86-64 hardware.

StateSMix: Online-Verlustlose-Kompression mittels Mamba-Zustandsraummodellen und sparsamer N-Gramm-Kontextmischung

StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing

Zusammenfassung

Support