언어 모델링은 압축이다

초록

예측 모델이 무손실 압축기로 변환될 수 있고 그 반대도 가능하다는 것은 오랫동안 알려진 사실입니다. 최근 몇 년 동안 머신러닝 커뮤니티는 점점 더 크고 강력한 자기 지도 학습(언어) 모델을 훈련하는 데 집중해 왔습니다. 이러한 대규모 언어 모델은 인상적인 예측 능력을 보여주기 때문에 강력한 압축기로 사용하기에 적합합니다. 본 연구에서는 예측 문제를 압축의 관점에서 바라보고 대형(파운데이션) 모델의 압축 능력을 평가합니다. 우리는 대규모 언어 모델이 강력한 범용 예측기임을 보여주며, 압축 관점이 스케일링 법칙, 토큰화, 그리고 컨텍스트 내 학습에 대한 새로운 통찰을 제공한다는 것을 입증합니다. 예를 들어, 주로 텍스트를 기반으로 훈련된 Chinchilla 70B는 ImageNet 패치를 원본 크기의 43.4%로, LibriSpeech 샘플을 16.4%로 압축하여 각각 PNG(58.5%)나 FLAC(30.3%)와 같은 도메인 특화 압축기를 능가합니다. 마지막으로, 예측-압축 동등성을 통해 gzip과 같은 임의의 압축기를 사용하여 조건부 생성 모델을 구축할 수 있음을 보여줍니다.

English

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

언어 모델링은 압축이다

Language Modeling Is Compression

초록

Support