将TransNormer 扩展至 1750 亿个参数

摘要

我们介绍了TransNormerLLM，这是第一个基于线性注意力的大型语言模型（LLM），在准确性和效率方面均优于传统的基于softmax注意力的模型。TransNormerLLM是基于先前的线性注意力架构TransNormer发展而来，通过包括位置嵌入、线性注意力加速、门控机制、张量归一化、推理加速和稳定化等先进修改来实现。具体来说，我们使用LRPE结合指数衰减来避免注意力稀释问题，同时允许模型保留标记之间的全局交互。此外，我们提出了闪电注意力，这是一种先进技术，能够将线性注意力的运行时间加速超过两倍，并将内存使用减少了显著的四倍。为了进一步提升TransNormer的性能，我们利用门控机制来平滑训练，并采用新的张量归一化方案来加速模型，实现了超过20%的显著加速。此外，我们开发了一种稳健的推理算法，确保数值稳定性和一致的推理速度，无论序列长度如何，都展现出在训练和推理阶段都具有出色效率的优势。可扩展性是我们模型设计的核心，使其能够无缝部署在大规模集群上，并便于扩展到更加庞大的模型，同时保持出色的性能指标。通过一系列在我们自行收集的语料库上的全面实验，我们对模型设计进行了严格验证，该语料库规模超过6TB，包含超过2万亿个标记。为了确保数据质量和相关性，我们实施了一种新的自我清理策略来过滤我们收集的数据。我们的预训练模型将会发布，以促进社区对高效LLM的进步。

English

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.

将TransNormer 扩展至 1750 亿个参数

Scaling TransNormer to 175 Billion Parameters

摘要

Support