語言模型的平行擴展定律

摘要

普遍認為，擴展語言模型應當付出顯著的空間或時間代價，這通常通過增加模型參數（參數擴展）或輸出標記（推理時擴展）來實現。我們引入了第三種更為高效的推理擴展範式：在訓練和推理期間同步提升模型的並行計算能力。我們對輸入施加P種多樣且可學習的變換，並行執行模型的前向傳播，並動態聚合這P個輸出。此方法，即並行擴展（ParScale），通過重用現有參數來擴展並行計算，可應用於任何模型結構、優化過程、數據或任務。我們從理論上提出了一種新的擴展定律，並通過大規模預訓練進行了驗證，結果表明，具有P個並行流的模型相當於將參數擴展了O(log P)倍，同時展現出更優的推理效率。例如，與達到相同性能提升的參數擴展相比，ParScale可減少多達22倍的內存增長和6倍的延遲增長。它還能通過對少量標記進行後訓練，將現成的預訓練模型轉化為並行擴展版本，進一步降低訓練成本。我們發現的這一新擴展定律，有望在資源受限的環境中促進更強大模型的部署，並為計算在機器學習中的角色提供了另一種視角。

English

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply P diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the P outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(log P) while showing superior inference efficiency. For example, ParScale can use up to 22times less memory increase and 6times less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

語言模型的平行擴展定律

Parallel Scaling Law for Language Models

摘要

Support