StateSMix: Mamba状態空間モデルとスパースN-gram文脈混合によるオンライン可逆圧縮

要旨

我々はStateSMixを提案する。これは、オンライン学習するMambaスタイルの状態空間モデル（SSM）と、スパースn-gramコンテキスト混合、および算術符号化を組み合わせた、完全に自己完結した無損失圧縮器である。このモデルはスクラッチから初期化され、圧縮対象のファイルに対してトークン単位で学習され、事前学習済みの重み、GPU、外部依存関係を一切必要としない。SSM（DM=32, NL=2、ファイルあたり約12万の活性パラメータ）はBPEトークンに対する確率推定を継続的に更新し、一方で9個のスパースn-gramハッシュテーブル（2-gramから32-gramまで、各1600万スロット）が、非ゼロカウントトークンのみを更新するソフトマックス不変のロジットバイアス機構を介して、正確な局所的および長距離的パターン記憶を追加する。エントロピー適応スケーリング機構は、SSMの予測確信度に基づいてn-gramの寄与を調整し、ニューラルモデルが既に較正されている場合の過補正を防ぐ。標準的なenwik8ベンチマークにおいて、StateSMixは1MBで2.123 bpb、3MBで2.149 bpb、10MBで2.162 bpbを達成し、xz -9e (LZMA2)をそれぞれ8.7%、5.4%、0.7%上回った。 ablation実験により、SSMが主要な圧縮エンジンであることが確認されている：SSM単体で頻度カウントベースラインよりも46.6%のサイズ削減を実現し、n-gramコンポーネントなしでxzを上回り、一方でn-gramテーブルは正確な文脈記憶を通じて相補的な4.1%の改善を提供する。学習ループのOpenMP並列化により、4コアで1.9倍の高速化を実現した。本システムは純粋なC言語で実装され、AVX2 SIMD命令を活用し、一般的なx86-64ハードウェア上で約毎秒2,000トークンを処理する。

English

We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model (SSM) with sparse n-gram context mixing and arithmetic coding. The model is initialised from scratch and trained token-by-token on the file being compressed, requiring no pre-trained weights, no GPU, and no external dependencies. The SSM (DM=32, NL=2, approximately 120K active parameters per file) provides a continuously-updated probability estimate over BPE tokens, while nine sparse n-gram hash tables (bigram through 32-gram, 16M slots each) add exact local and long-range pattern memorisation via a softmax-invariant logit-bias mechanism that updates only non-zero-count tokens. An entropy-adaptive scaling mechanism modulates the n-gram contribution based on the SSM's predictive confidence, preventing over-correction when the neural model is already well-calibrated. On the standard enwik8 benchmark, StateSMix achieves 2.123 bpb on 1 MB, 2.149 bpb on 3 MB, and 2.162 bpb on 10 MB, beating xz -9e (LZMA2) by 8.7%, 5.4%, and 0.7% respectively. Ablation experiments establish the SSM as the dominant compression engine: it alone accounts for a 46.6% size reduction over a frequency-count baseline and beats xz without any n-gram component, while n-gram tables provide a complementary 4.1% gain through exact context memorisation. OpenMP parallelisation of the training loop yields 1.9x speedup on 4 cores. The system is implemented in pure C with AVX2 SIMD and processes approximately 2,000 tokens per second on commodity x86-64 hardware.

StateSMix: Mamba状態空間モデルとスパースN-gram文脈混合によるオンライン可逆圧縮

StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing

要旨

Support