신경망 기반 압축 텍스트를 활용한 대형 언어 모델 학습

초록

본 논문에서는 고도로 압축된 텍스트를 기반으로 대규모 언어 모델(LLM)을 훈련시키는 아이디어를 탐구합니다. 표준 서브워드 토크나이저는 텍스트를 작은 비율로 압축하지만, 신경망 기반 텍스트 압축기는 훨씬 더 높은 압축률을 달성할 수 있습니다. 신경망으로 압축된 텍스트를 직접 LLM 훈련에 사용할 수 있다면, 훈련 및 서빙 효율성이 향상되고 긴 텍스트 범위를 더 쉽게 처리할 수 있는 장점이 있습니다. 그러나 이러한 목표를 달성하는 주요 장애물은 강력한 압축이 학습에 적합하지 않은 불투명한 출력을 생성하는 경향이 있다는 점입니다. 특히, 우리는 Arithmetic Coding을 통해 단순히 압축된 텍스트가 LLM에 의해 쉽게 학습되지 않음을 발견했습니다. 이를 극복하기 위해, 우리는 Equal-Info Windows라는 새로운 압축 기법을 제안합니다. 이 기법은 텍스트를 각각 동일한 비트 길이로 압축되는 블록으로 분할합니다. 이 방법을 사용하여, 우리는 신경망으로 압축된 텍스트에 대한 효과적인 학습을 입증했으며, 이는 규모가 커질수록 개선되고, perplexity 및 추론 속도 벤치마크에서 바이트 수준의 기준선을 크게 능가합니다. 우리의 방법은 동일한 매개변수 수로 훈련된 모델에 대해 서브워드 토크나이저보다 더 나쁜 perplexity를 보이지만, 더 짧은 시퀀스 길이라는 이점이 있습니다. 더 짧은 시퀀스 길이는 더 적은 자동회귀 생성 단계를 필요로 하며, 지연 시간을 줄입니다. 마지막으로, 우리는 학습 가능성에 기여하는 속성에 대한 광범위한 분석을 제공하고, 고압축 토크나이저의 성능을 더욱 개선하기 위한 구체적인 제안을 제시합니다.

English

In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text na\"ively compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.

신경망 기반 압축 텍스트를 활용한 대형 언어 모델 학습

Training LLMs over Neurally Compressed Text

초록

Summary

Support

Support