語言模型的平行擴展定律
Parallel Scaling Law for Language Models
May 15, 2025
作者: Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu
cs.AI
摘要
普遍認為,擴展語言模型應當付出顯著的空間或時間代價,這通常通過增加模型參數(參數擴展)或輸出標記(推理時擴展)來實現。我們引入了第三種更為高效的推理擴展範式:在訓練和推理期間同步提升模型的並行計算能力。我們對輸入施加P種多樣且可學習的變換,並行執行模型的前向傳播,並動態聚合這P個輸出。此方法,即並行擴展(ParScale),通過重用現有參數來擴展並行計算,可應用於任何模型結構、優化過程、數據或任務。我們從理論上提出了一種新的擴展定律,並通過大規模預訓練進行了驗證,結果表明,具有P個並行流的模型相當於將參數擴展了O(log P)倍,同時展現出更優的推理效率。例如,與達到相同性能提升的參數擴展相比,ParScale可減少多達22倍的內存增長和6倍的延遲增長。它還能通過對少量標記進行後訓練,將現成的預訓練模型轉化為並行擴展版本,進一步降低訓練成本。我們發現的這一新擴展定律,有望在資源受限的環境中促進更強大模型的部署,並為計算在機器學習中的角色提供了另一種視角。
English
It is commonly believed that scaling language models should commit a
significant space or time cost, by increasing the parameters (parameter
scaling) or output tokens (inference-time scaling). We introduce the third and
more inference-efficient scaling paradigm: increasing the model's parallel
computation during both training and inference time. We apply P diverse and
learnable transformations to the input, execute forward passes of the model in
parallel, and dynamically aggregate the P outputs. This method, namely
parallel scaling (ParScale), scales parallel computation by reusing existing
parameters and can be applied to any model structure, optimization procedure,
data, or task. We theoretically propose a new scaling law and validate it
through large-scale pre-training, which shows that a model with P parallel
streams is similar to scaling the parameters by O(log P) while showing
superior inference efficiency. For example, ParScale can use up to 22times
less memory increase and 6times less latency increase compared to parameter
scaling that achieves the same performance improvement. It can also recycle an
off-the-shelf pre-trained model into a parallelly scaled one by post-training
on a small amount of tokens, further reducing the training budget. The new
scaling law we discovered potentially facilitates the deployment of more
powerful models in low-resource scenarios, and provides an alternative
perspective for the role of computation in machine learning.Summary
AI-Generated Summary