语言模型的并行扩展定律

摘要

普遍认为，扩展语言模型需要付出显著的空间或时间成本，这通常通过增加模型参数（参数扩展）或输出标记（推理时扩展）来实现。我们引入了第三种更为高效的扩展范式：在训练和推理期间增加模型的并行计算能力。我们对输入应用P种多样且可学习的变换，并行执行模型的前向传播，并动态聚合P个输出结果。这种方法，即并行扩展（ParScale），通过复用现有参数来扩展并行计算，可应用于任何模型结构、优化过程、数据或任务。我们从理论上提出了一种新的扩展定律，并通过大规模预训练验证了其有效性，表明具有P个并行流的模型在效果上类似于将参数扩展了O(log P)倍，同时展现出更优的推理效率。例如，与达到相同性能提升的参数扩展相比，ParScale可减少高达22倍的内存增长和6倍的延迟增长。此外，它还能通过少量标记的后训练，将现成的预训练模型转化为并行扩展版本，进一步降低训练成本。我们发现的这一新扩展定律，有望在资源有限的环境中促进更强大模型的部署，并为计算在机器学习中的作用提供了新的视角。

English

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply P diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the P outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P) while showing superior inference efficiency. For example, ParScale can use up to 22times less memory increase and 6times less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

语言模型的并行扩展定律

Parallel Scaling Law for Language Models

摘要

Summary

Support

Support