在神经压缩文本上训练LLM模型

摘要

在本文中，我们探讨了在高度压缩文本上训练大型语言模型（LLMs）的想法。标准的子词标记器通过较小的因子压缩文本，而神经文本压缩器可以实现更高比率的压缩。如果能够直接在神经压缩文本上训练LLMs，这将带来培训和服务效率方面的优势，以及更容易处理长文本跨度。实现这一目标的主要障碍在于强压缩往往会产生不适合学习的不透明输出。特别是，我们发现通过算术编码天真压缩的文本不容易被LLMs学习。为了克服这一障碍，我们提出了Equal-Info Windows，一种新颖的压缩技术，其中文本被分割成每个块都压缩到相同比特长度的块。使用这种方法，我们展示了在神经压缩文本上的有效学习，随着规模的扩大而改善，并在困惑度和推理速度基准测试中大幅优于字节级基线。虽然我们的方法在具有相同参数数量的模型上训练时比子词标记器提供了更差的困惑度，但它具有更短的序列长度的好处。较短的序列长度需要更少的自回归生成步骤，并减少延迟。最后，我们对有助于可学习性的属性进行了广泛分析，并提出了如何进一步改进高压缩标记器性能的具体建议。

English

In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text na\"ively compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.

在神经压缩文本上训练LLM模型

Training LLMs over Neurally Compressed Text

摘要

Support