마이크로 확산 압축 — 온라인 확률 추정을 위한 이진 트리 트위디 노이즈 제거

초록

우리는 적응형 통계 모델이 생성한 확률 추정치를 개선하기 위해 미세 확산 디노이징 계층을 도입한 무손실 압축 시스템인 Midicoth를 제안한다. PPM(Prediction by Partial Matching)과 같은 압축기에서는 희소 관측을 처리하기 위해 사전 분포를 통해 확률 추정치를 평활화한다. 컨텍스트가 소수만 관찰된 경우, 이 사전 분포가 예측을 지배하여 실제 소스 분포보다 현저히 평평한 분포를 생성함으로써 압축 비효율성을 초래한다. Midicoth는 사전 평활화를 수축 과정으로 간주하고 경험적 보정 통계를 활용하여 예측 확률을 교정하는 역 디노이징 단계를 적용함으로써 이러한 한계를 해결한다. 이 교정을 데이터 효율적으로 만들기 위해 본 방법은 각 바이트 예측을 비트 단위 트리를 따른 이진 결정 계층 구조로 분해한다. 이를 통해 단일 256차원 보정 문제를 일련의 이진 보정 작업으로 변환하여 상대적으로 적은 관측 횟수로도 신뢰할 수 있는 교정 항 추정이 가능하게 한다. 디노이징 과정은 여러 단계에 걸쳐 순차적으로 적용되며, 각 단계는 이전 단계에서 남은 잔여 예측 오차를 정제할 수 있도록 한다. 미세 확산 계층은 모든 모델 예측이 결합된 후 적용되는 경량 사후 혼합 보정 단계로 작동하여 최종 확률 분포의 체계적 편향을 교정할 수 있다. Midicoth는 적응형 PPM 모델, 장거리 매칭 모델, 트라이 기반 단어 모델, 고차 컨텍스트 모델, 그리고 최종 단계로 적용되는 미세 확산 디노이저라는 다섯 가지 완전 온라인 구성 요소를 통합한다.

English

We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.

마이크로 확산 압축 — 온라인 확률 추정을 위한 이진 트리 트위디 노이즈 제거

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

초록

Support