言語モデリングは圧縮である

要旨

予測モデルをロスレス圧縮器に変換できること、またその逆も可能であることは、長らく確立された事実である。近年、機械学習コミュニティは、ますます大規模で強力な自己教師あり（言語）モデルの訓練に焦点を当ててきた。これらの大規模言語モデルは驚異的な予測能力を示すため、強力な圧縮器としてのポテンシャルを十分に有している。本研究では、予測問題を圧縮の観点から捉えることを提唱し、大規模（基盤）モデルの圧縮能力を評価する。我々は、大規模言語モデルが汎用的な強力な予測器であること、そして圧縮の視点がスケーリング則、トークン化、文脈内学習に関する新たな洞察を提供することを示す。例えば、Chinchilla 70Bは主にテキストで訓練されているにもかかわらず、ImageNetのパッチを43.4%、LibriSpeechのサンプルを16.4%の生サイズに圧縮し、それぞれPNG（58.5%）やFLAC（30.3%）といったドメイン固有の圧縮器を上回る性能を示す。最後に、予測と圧縮の等価性を利用して、任意の圧縮器（例えばgzip）を用いて条件付き生成モデルを構築できることを示す。

English

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

言語モデリングは圧縮である

Language Modeling Is Compression

要旨

Support