LLMZip:使用大型语言模型进行无损文本压缩
LLMZip: Lossless Text Compression using Large Language Models
June 6, 2023
作者: Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai
cs.AI
摘要
我们利用大型语言模型LLaMA-7B对过去标记窗口给出的下一个标记进行预测,提供了英语熵的渐近上界的新估计。这一估计明显小于目前可获得的估计(cover1978convergent,lutati2023focus)。一个自然的副产品是一种用大型语言模型的预测与无损压缩方案相结合的英语文本无损压缩算法。有限实验的初步结果表明,我们的方案优于BSC、ZPAQ和paq8h等最先进的文本压缩方案。
English
We provide new estimates of an asymptotic upper bound on the entropy of
English using the large language model LLaMA-7B as a predictor for the next
token given a window of past tokens. This estimate is significantly smaller
than currently available estimates in cover1978convergent,
lutati2023focus. A natural byproduct is an algorithm for lossless
compression of English text which combines the prediction from the large
language model with a lossless compression scheme. Preliminary results from
limited experiments suggest that our scheme outperforms state-of-the-art text
compression schemes such as BSC, ZPAQ, and paq8h.