LLMZip:使用大型語言模型進行無損文本壓縮
LLMZip: Lossless Text Compression using Large Language Models
June 6, 2023
作者: Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai
cs.AI
摘要
我們利用大型語言模型LLaMA-7B來預測下一個token在過去token窗口中的條件下,提供了對英語熵的漸近上界的新估計。這個估計顯著小於目前在cover1978convergent、lutati2023focus中可用的估計。一個自然的副產物是一個用於英語文本的無損壓縮算法,該算法將大型語言模型的預測與無損壓縮方案結合。有限實驗的初步結果表明,我們的方案優於BSC、ZPAQ和paq8h等最先進的文本壓縮方案。
English
We provide new estimates of an asymptotic upper bound on the entropy of
English using the large language model LLaMA-7B as a predictor for the next
token given a window of past tokens. This estimate is significantly smaller
than currently available estimates in cover1978convergent,
lutati2023focus. A natural byproduct is an algorithm for lossless
compression of English text which combines the prediction from the large
language model with a lossless compression scheme. Preliminary results from
limited experiments suggest that our scheme outperforms state-of-the-art text
compression schemes such as BSC, ZPAQ, and paq8h.