微扩散压缩——基于二叉树的Tweedie去噪在线概率估计
Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation
March 9, 2026
作者: Roberto Tacconelli
cs.AI
摘要
本文提出Midicoth无损压缩系统,该系统引入微扩散去噪层以改进自适应统计模型生成的概率估计。在预测部分匹配(PPM)等压缩器中,概率估计需通过先验分布进行平滑处理以应对稀疏观测问题。当上下文出现频次较低时,先验主导预测过程并产生比真实信源分布更为平坦的概率分布,导致压缩效率下降。Midicoth通过将先验平滑视为收缩过程,并应用基于经验校准统计的反向去噪步骤来修正预测概率,从而突破这一局限。为实现数据高效的校正,该方法将每个字节预测分解为沿比特树结构的二元决策层次,将256维校准问题转化为序列化二元校准任务,使得仅需较少观测值即可实现校正项的可靠估计。去噪过程通过多级递进实施,每一阶段均可修正前序阶段残留的预测误差。微扩散层作为轻量级后融合校准阶段,在所有模型预测整合后启动,可修正最终概率分布的系统性偏差。Midicoth集成五个全在线组件:自适应PPM模型、长程匹配模型、基于字典树的词汇模型、高阶上下文模型,以及作为最终阶段的微扩散去噪器。
English
We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.