マイクロ拡散圧縮 -- オンライン確率推定のための二分木トゥイーディーノイズ除去

要旨

本論文では、適応統計モデルによって生成される確率推定を改善するためのマイクロ拡散デノイジング層を導入した、ロスレス圧縮システム「Midicoth」を提案する。Prediction by Partial Matching (PPM) などの圧縮器では、スパースな観測値を扱うために、確率推定は事前分布によって平滑化される。文脈が少数回しか観測されていない場合、この事前分布が予測を支配し、真のソース分布よりも著しく平坦な分布を生成するため、圧縮効率の低下を招く。Midicothは、事前平滑化を収縮過程として扱い、経験的キャリブレーション統計を用いて予測確率を補正する逆デノイジングステップを適用することで、この制限に対処する。この補正をデータ効率的に行うため、本手法は各バイト予測を、ビット単位の木に沿った二値決定の階層に分解する。これにより、単一の256値キャリブレーション問題を一連の二値キャリブレーション課題に変換し、比較的少数の観測から補正項を確実に推定することを可能にする。デノイジングプロセスは複数の連続ステップで適用され、各段階が前の段階で残された残留予測誤差を精緻化する。マイクロ拡散層は、全てのモデル予測が結合された後に適用される軽量なポストブレンドキャリブレーション段階として機能し、最終確率分布の系統的バイアスを補正する。Midicothは、適応PPMモデル、長距離マッチングモデル、トライベースの単語モデル、高次文脈モデル、および最終段階として適用されるマイクロ拡散デノイザーの、5つの完全オンラインコンポーネントを統合する。

English

We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.

マイクロ拡散圧縮 -- オンライン確率推定のための二分木トゥイーディーノイズ除去

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

要旨

Support