语言建模即为压缩。
Language Modeling Is Compression
September 19, 2023
作者: Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness
cs.AI
摘要
长期以来,人们已经确认预测模型可以转化为无损压缩器,反之亦然。近年来,机器学习社区专注于训练越来越大、更强大的自监督(语言)模型。由于这些大型语言模型展现出令人印象深刻的预测能力,它们具备成为强大压缩器的潜力。在这项工作中,我们主张通过压缩的视角来看待预测问题,并评估大型(基础)模型的压缩能力。我们展示了大型语言模型是强大的通用预测器,并且压缩视角为扩展规律、标记化和上下文学习提供了新颖见解。例如,Chinchilla 70B虽然主要在文本上训练,但将ImageNet补丁压缩到其原始大小的43.4%,将LibriSpeech样本压缩到其原始大小的16.4%,分别超过了领域特定的压缩器如PNG(58.5%)或FLAC(30.3%)。最后,我们展示了预测-压缩等价性使我们能够使用任何压缩器(如gzip)构建条件生成模型。
English
It has long been established that predictive models can be transformed into
lossless compressors and vice versa. Incidentally, in recent years, the machine
learning community has focused on training increasingly large and powerful
self-supervised (language) models. Since these large language models exhibit
impressive predictive capabilities, they are well-positioned to be strong
compressors. In this work, we advocate for viewing the prediction problem
through the lens of compression and evaluate the compression capabilities of
large (foundation) models. We show that large language models are powerful
general-purpose predictors and that the compression viewpoint provides novel
insights into scaling laws, tokenization, and in-context learning. For example,
Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to
43.4% and LibriSpeech samples to 16.4% of their raw size, beating
domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
Finally, we show that the prediction-compression equivalence allows us to use
any compressor (like gzip) to build a conditional generative model.