PanGu-π：通過非線性補償增強語言模型架構

摘要

近來大型語言模型（LLMs）的趨勢是增加模型大小（參數數量）和數據集的規模，以實現更好的生成能力，這在許多工作中已被證實，如著名的GPT和Llama。然而，大型模型往往涉及巨大的計算成本，實際應用無法負擔如此高昂的價格。然而，建構強大的LLMs模型架構的方法很少被討論。我們首先分析了最先進的語言模型架構，並觀察到特徵崩潰問題。基於理論分析，我們提出非線性對於語言模型也非常重要，這通常在用於視覺任務的卷積神經網絡中研究。然後引入了一系列資訊激活函數，通過可以忽略的微小計算，進一步使用擴展的捷徑來增強模型的非線性。然後，我們證明了所提出的方法通過精心設計的消融實驗對增強模型的非線性效果顯著；因此，我們提出了一種新的高效模型架構，即PanGu-pi。然後使用相同的數據集和訓練策略進行實驗，將PanGu-pi與最先進的LLMs進行比較。結果顯示，PanGu-pi-7B可以實現與基準相當的性能，並且推理速度提高約10％，而PanGu-pi-1B在準確性和效率方面可以實現最先進的性能。此外，我們已在金融和法律等高價值領域部署了PanGu-pi-7B，開發了一個名為YunShan的LLM用於實際應用。結果表明，YunShan在基準測試中可以超越其他相似規模的模型。

English

The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art language model architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for language models, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is further used to enhance the model nonlinearity. We then demonstrate that the proposed approach is significantly effective for enhancing the model nonlinearity through carefully designed ablations; thus, we present a new efficient model architecture for establishing modern, namely, PanGu-pi. Experiments are then conducted using the same dataset and training strategy to compare PanGu-pi with state-of-the-art LLMs. The results show that PanGu-pi-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-pi-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-pi-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks.

PanGu-π：通過非線性補償增強語言模型架構

PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation

摘要

Support