PanGu-π:通過非線性補償增強語言模型架構
PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation
December 27, 2023
作者: Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, Dacheng Tao
cs.AI
摘要
近來大型語言模型(LLMs)的趨勢是增加模型大小(參數數量)和數據集的規模,以實現更好的生成能力,這在許多工作中已被證實,如著名的GPT和Llama。然而,大型模型往往涉及巨大的計算成本,實際應用無法負擔如此高昂的價格。然而,建構強大的LLMs模型架構的方法很少被討論。我們首先分析了最先進的語言模型架構,並觀察到特徵崩潰問題。基於理論分析,我們提出非線性對於語言模型也非常重要,這通常在用於視覺任務的卷積神經網絡中研究。然後引入了一系列資訊激活函數,通過可以忽略的微小計算,進一步使用擴展的捷徑來增強模型的非線性。然後,我們證明了所提出的方法通過精心設計的消融實驗對增強模型的非線性效果顯著;因此,我們提出了一種新的高效模型架構,即PanGu-pi。然後使用相同的數據集和訓練策略進行實驗,將PanGu-pi與最先進的LLMs進行比較。結果顯示,PanGu-pi-7B可以實現與基準相當的性能,並且推理速度提高約10%,而PanGu-pi-1B在準確性和效率方面可以實現最先進的性能。此外,我們已在金融和法律等高價值領域部署了PanGu-pi-7B,開發了一個名為YunShan的LLM用於實際應用。結果表明,YunShan在基準測試中可以超越其他相似規模的模型。
English
The recent trend of large language models (LLMs) is to increase the scale of
both model size (\aka the number of parameters) and dataset to achieve better
generative ability, which is definitely proved by a lot of work such as the
famous GPT and Llama. However, large models often involve massive computational
costs, and practical applications cannot afford such high prices. However, the
method of constructing a strong model architecture for LLMs is rarely
discussed. We first analyze the state-of-the-art language model architectures
and observe the feature collapse problem. Based on the theoretical analysis, we
propose that the nonlinearity is also very important for language models, which
is usually studied in convolutional neural networks for vision tasks. The
series informed activation function is then introduced with tiny calculations
that can be ignored, and an augmented shortcut is further used to enhance the
model nonlinearity. We then demonstrate that the proposed approach is
significantly effective for enhancing the model nonlinearity through carefully
designed ablations; thus, we present a new efficient model architecture for
establishing modern, namely, PanGu-pi. Experiments are then conducted using
the same dataset and training strategy to compare PanGu-pi with
state-of-the-art LLMs. The results show that PanGu-pi-7B can achieve a
comparable performance to that of benchmarks with about 10\% inference
speed-up, and PanGu-pi-1B can achieve state-of-the-art performance in terms
of accuracy and efficiency. In addition, we have deployed PanGu-pi-7B in the
high-value domains of finance and law, developing an LLM named YunShan for
practical application. The results show that YunShan can surpass other models
with similar scales on benchmarks.