StateSMix：基于Mamba状态空间模型与稀疏N元上下文混合的在线无损压缩

摘要

我们提出StateSMix——一种完全自包含的无损压缩器，将在线训练的Mamba风格状态空间模型（SSM）与稀疏n元语法上下文混合及算术编码相结合。该模型从零开始初始化，并在被压缩文件上逐令牌训练，无需预训练权重、GPU或外部依赖。SSM（维度DM=32，层数NL=2，每个文件约12万个活跃参数）提供基于BPE令牌的持续更新概率估计，而九个稀疏n元语法哈希表（从二元到32元语法，各含1600万槽位）通过仅更新非零计数令牌的softmax不变对数偏置机制，实现精确的局部与长程模式记忆。熵自适应缩放机制根据SSM的预测置信度调节n元语法贡献度，在神经网络模型已良好校准时避免过度修正。在enwik8标准测试集上，StateSMix在1MB、3MB和10MB数据量分别达到2.123bpb、2.149bpb和2.162bpb，相较xz -9e（LZMA2）压缩率提升8.7%、5.4%和0.7%。消融实验证实SSM是主导压缩引擎：仅SSM即可在频率计数基线基础上实现46.6%的压缩提升，且在不使用n元语法组件时仍优于xz；而n元语法表通过精确上下文记忆提供4.1%的互补增益。训练循环的OpenMP并行化在4核处理器上实现1.9倍加速。该系统采用纯C语言实现并支持AVX2 SIMD指令集，在商用x86-64硬件上处理速度约达每秒2,000令牌。

English

We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model (SSM) with sparse n-gram context mixing and arithmetic coding. The model is initialised from scratch and trained token-by-token on the file being compressed, requiring no pre-trained weights, no GPU, and no external dependencies. The SSM (DM=32, NL=2, approximately 120K active parameters per file) provides a continuously-updated probability estimate over BPE tokens, while nine sparse n-gram hash tables (bigram through 32-gram, 16M slots each) add exact local and long-range pattern memorisation via a softmax-invariant logit-bias mechanism that updates only non-zero-count tokens. An entropy-adaptive scaling mechanism modulates the n-gram contribution based on the SSM's predictive confidence, preventing over-correction when the neural model is already well-calibrated. On the standard enwik8 benchmark, StateSMix achieves 2.123 bpb on 1 MB, 2.149 bpb on 3 MB, and 2.162 bpb on 10 MB, beating xz -9e (LZMA2) by 8.7%, 5.4%, and 0.7% respectively. Ablation experiments establish the SSM as the dominant compression engine: it alone accounts for a 46.6% size reduction over a frequency-count baseline and beats xz without any n-gram component, while n-gram tables provide a complementary 4.1% gain through exact context memorisation. OpenMP parallelisation of the training loop yields 1.9x speedup on 4 cores. The system is implemented in pure C with AVX2 SIMD and processes approximately 2,000 tokens per second on commodity x86-64 hardware.

StateSMix：基于Mamba状态空间模型与稀疏N元上下文混合的在线无损压缩

StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing

摘要

Support