Sprachmodellierung ist Kompression

Zusammenfassung

Es ist seit langem bekannt, dass prädiktive Modelle in verlustfreie Kompressoren umgewandelt werden können und umgekehrt. Zufälligerweise hat sich die Machine-Learning-Community in den letzten Jahren darauf konzentriert, immer größere und leistungsfähigere selbstüberwachte (Sprach-)Modelle zu trainieren. Da diese großen Sprachmodelle beeindruckende prädiktive Fähigkeiten aufweisen, eignen sie sich hervorragend als starke Kompressoren. In dieser Arbeit plädieren wir dafür, das Prädiktionsproblem durch die Brille der Kompression zu betrachten und bewerten die Kompressionsfähigkeiten großer (Foundation-)Modelle. Wir zeigen, dass große Sprachmodelle leistungsstarke allgemeine Prädiktoren sind und dass die Kompressionsperspektive neue Einblicke in Skalierungsgesetze, Tokenisierung und In-Context-Lernen bietet. Beispielsweise komprimiert Chinchilla 70B, das hauptsächlich auf Text trainiert wurde, ImageNet-Patches auf 43,4 % und LibriSpeech-Proben auf 16,4 % ihrer Rohgröße und übertrifft damit domänenspezifische Kompressoren wie PNG (58,5 %) oder FLAC (30,3 %). Schließlich zeigen wir, dass die Äquivalenz von Prädiktion und Kompression es uns ermöglicht, jeden Kompressor (wie gzip) zu verwenden, um ein bedingtes generatives Modell zu erstellen.

English

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Sprachmodellierung ist Kompression

Language Modeling Is Compression

Zusammenfassung

Support