將 TransNormer 擴展至 1750 億個參數

摘要

我們提出了TransNormerLLM，這是第一個基於線性注意力的大型語言模型（LLM），在準確性和效率方面均優於傳統基於softmax注意力的模型。TransNormerLLM是基於先前的線性注意力架構TransNormer進化而來，通過包括位置嵌入、線性注意力加速、閘控機制、張量歸一化、推理加速和穩定化等先進修改。具體來說，我們使用LRPE與指數衰減，以避免注意力稀釋問題，同時允許模型保留標記之間的全局交互作用。此外，我們提出了Lightning Attention，一種先進技術，可以將線性注意力運行時間加速超過兩倍，並將內存使用量減少四倍。為了進一步提升TransNormer的性能，我們利用閘控機制平滑訓練，並採用新的張量歸一化方案加速模型，實現超過20%的顯著加速。此外，我們開發了一種強大的推理算法，確保數值穩定性和一致的推理速度，無論序列長度如何，都展現出在訓練和推理階段均具有卓越效率。我們的模型設計著重於可擴展性，可以無縫部署在大型集群上，並有助於擴展到更加龐大的模型，同時保持優異的性能指標。通過在我們自行收集的語料庫上進行一系列全面的實驗，其中包含超過6TB的數據，包含超過2萬億個標記，我們實現了對模型設計的嚴格驗證。為確保數據質量和相關性，我們實施了一種新的自潔策略來過濾我們收集的數據。我們的預訓練模型將被釋出，以促進社區對高效LLM的進步。

English

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.

將 TransNormer 擴展至 1750 億個參數

Scaling TransNormer to 175 Billion Parameters

摘要

Support