BTLM-3B-8K：7B参数在3B参数模型中的性能

摘要

我们介绍了Bittensor语言模型，称为“BTLM-3B-8K”，这是一个新的最先进的30亿参数开源语言模型。BTLM-3B-8K在SlimPajama数据集的627B标记上进行了训练，使用了2,048和8,192上下文长度的混合。BTLM-3B-8K在下游任务中胜过所有现有的30亿参数模型，性能提高了2-5.5%。BTLM-3B-8K甚至与一些70亿参数模型具有竞争力。此外，BTLM-3B-8K在长上下文性能方面表现出色，在高达8,192上下文长度的任务中胜过了MPT-7B-8K和XGen-7B-8K。我们在经过清理和去重的SlimPajama数据集上训练了模型；积极调整了μP超参数和调度；使用了ALiBi位置嵌入；并采用了SwiGLU非线性。在Hugging Face上，最受欢迎的模型具有70亿参数，表明用户更喜欢70亿模型的质量-大小比。将70亿参数模型压缩为30亿参数模型，并且性能影响较小，是一个重要的里程碑。BTLM-3B-8K仅需3GB内存，精度为4位，推断计算量比70亿模型少2.5倍，有助于在移动和边缘设备上访问功能强大的语言模型。BTLM-3B-8K在Hugging Face上以Apache 2.0许可证发布：https://huggingface.co/cerebras/btlm-3b-8k-base。

English

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.

BTLM-3B-8K：7B参数在3B参数模型中的性能

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

摘要

Support