Nacrith: Neuronale verlustfreie Kompression durch Ensemble-Kontextmodellierung und Hochpräzisions-CDF-Codierung

Zusammenfassung

Wir stellen Nacrith vor, ein verlustfreies Kompressionssystem, das ein Transformer-Sprachmodell mit 135 Millionen Parametern (SmolLM2-135M) mit einem Ensemble von leichtgewichtigen Online-Prädiktoren und einem 32-Bit-Arithmetischen Kodierer kombiniert. Über das grundlegende Paradigma "LLM plus Arithmetische Kodierung" hinaus führt Nacrith mehrere Beiträge ein: (1) ein Upgrade der CDF-Präzision von 2^16 auf 2^24, das ~75 % des Quantisierungs-Overheads eliminiert, der durch Mindestwahrscheinlichkeits-Untergrenzen in großen Vokabularen verursacht wird; (2) ein Token-level N-Gram-Modell für schnelle lokale Vorhersagen; (3) ein adaptiver Bias-Kopf im Logarithmus-Raum, der LLM-Fehler pro Dokument mittels Online Gradient Descent korrigiert; (4) ein konfidenzbasiertes LLM-Skipping zur Beschleunigung hochgradig vorhersagbarer Tokens; (5) ein hybrides Binärformat (NC06), das neuronale Kompression auf beliebige Binärdateien erweitert – unseres Wissens eine Neuheit unter LLM-basierten Kompressoren; (6) ein llama.cpp-Inferenz-Backend, das eine ~7x schnellere Einzeltoken-Decodierung als PyTorch erreicht; (7) parallele Multi-GPU-Kompression mit bis zu 8 Workern; und (8) ein natives KV-Cache-Sliding-Window, das die Kosten pro Slide um ~37x reduziert. Das System benötigt nur ~500 MB GGUF-Gewichte und ~1,2 GB VRAM pro Worker und läuft auf Consumer-GPUs. Auf alice29.txt (Canterbury Corpus, 152 KB) erreicht Nacrith 0,918 Bits pro Byte (bpb) – dies übertrifft gzip um das 3,1-fache, bzip2 um das 2,5-fache, CMIX v21 um 44 % und ts_zip um 20 %, wobei es unter die byte-basierten Shannon-Entropiegrenzen 0., 1. und 2. Ordnung komprimiert. Auf enwik8 (100 MB) erreicht Nacrith 0,9389 bpb (11,74 %) und übertrifft ts_zip (~1,11 bpb) um 15 % und FineZip (1,024 bpb) um 8 %, obwohl es ein 60x kleineres Modell ohne Feinjustierung verwendet. Eine Out-of-Distribution-Evaluierung an einem Dokument, das nach dem Trainings-Cutoff des Modells veröffentlicht wurde, bestätigt, dass diese Gewinne keine Memorisierungsartefakte sind, indem auf ungesehenem Text 0,723 bpb erreicht werden.

English

We present Nacrith, a lossless compression system that combines a 135M-parameter transformer language model (SmolLM2-135M) with an ensemble of lightweight online predictors and a 32-bit arithmetic coder. Beyond the base LLM-plus-arithmetic-coding paradigm, Nacrith introduces several contributions: (1) a CDF precision upgrade from 2^16 to 2^24 that eliminates ~75% of quantization overhead caused by minimum-probability floors in large vocabularies; (2) a token-level N-gram model for fast local predictions; (3) an adaptive log-space bias head correcting per-document LLM errors via online gradient descent; (4) confidence-based LLM skip for accelerating highly predictable tokens; (5) a hybrid binary format (NC06) extending neural compression to arbitrary binary files--to our knowledge a first among LLM-based compressors; (6) a llama.cpp inference backend achieving ~7x faster single-token decode than PyTorch; (7) parallel multi-GPU compression across up to 8 workers; and (8) native KV cache sliding window reducing per-slide cost by ~37x. The system requires only ~500 MB of GGUF weights and ~1.2 GB VRAM per worker, running on consumer GPUs. On alice29.txt (Canterbury Corpus, 152 KB), Nacrith achieves 0.918 bits per byte (bpb)--outperforming gzip by 3.1x, bzip2 by 2.5x, CMIX v21 by 44%, and ts_zip by 20%, while compressing below the 0th-, 1st-, and 2nd-order byte-level Shannon entropy bounds. On enwik8 (100 MB), Nacrith achieves 0.9389 bpb (11.74%), surpassing ts_zip (~1.11 bpb) by 15% and FineZip (1.024 bpb) by 8% despite using a 60x smaller model with no fine-tuning. An out-of-distribution evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.

Nacrith: Neuronale verlustfreie Kompression durch Ensemble-Kontextmodellierung und Hochpräzisions-CDF-Codierung

Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

Zusammenfassung

Support