微扩散压缩——基于二叉树的Tweedie去噪在线概率估计

摘要

我们推出Midicoth无损压缩系统，该系统引入微扩散去噪层以改进自适应统计模型生成的概率估计。在诸如局部匹配预测（PPM）等压缩器中，概率估计会通过先验分布进行平滑处理以应对稀疏观测问题。当上下文仅出现少数几次时，该先验主导预测过程并产生比真实信源分布显著平坦的概率分布，导致压缩效率低下。Midicoth通过将先验平滑视为收缩过程，并应用基于经验校准统计的反向去噪步骤来校正预测概率，从而解决这一局限。为实现数据高效校正，该方法将每个字节预测分解为沿比特树结构的二元决策层次。这将单一的256路校准问题转化为序列化二元校准任务，使得能够从相对少量观测中可靠估计校正项。去噪过程通过多级连续步骤实施，允许每一阶段精炼前序步骤遗留的残差预测误差。微扩散层作为轻量级后融合校准阶段，在所有模型预测合并后实施，可修正最终概率分布中的系统性偏差。Midicoth整合了五个全在线组件：自适应PPM模型、长程匹配模型、基于字典树的词汇模型、高阶上下文模型，以及作为最终阶段的微扩散去噪器。

English

We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.