BTLM-3B-8K：在3B參數模型中7B參數的表現

摘要

我們介紹了Bittensor語言模型，名為"BTLM-3B-8K"，這是一個新的最先進的30億參數開源語言模型。BTLM-3B-8K是在SlimPajama數據集的627B tokens上訓練的，使用了2,048和8,192上下文長度的混合。BTLM-3B-8K在下游任務中表現優於所有現有的30億參數模型，提高了2-5.5%。BTLM-3B-8K甚至與一些70億參數模型具有競爭力。此外，BTLM-3B-8K提供了出色的長上下文性能，在長達8,192上下文長度的任務中優於MPT-7B-8K和XGen-7B-8K。我們在經過清理和去重的SlimPajama數據集上訓練了模型；積極調整了μP超參數和時間表；使用了ALiBi位置嵌入；並採用了SwiGLU非線性。在Hugging Face上，最受歡迎的模型具有70億參數，這表明用戶更喜歡70億模型的質量-大小比。將70億參數模型壓縮為30億參數模型，並幾乎不影響性能，是一個重要的里程碑。BTLM-3B-8K僅需要3GB內存，精度為4位，並且比70億模型的推理計算少2.5倍，有助於在移動和邊緣設備上開放強大的語言模型。BTLM-3B-8K在Hugging Face上以Apache 2.0許可證提供：https://huggingface.co/cerebras/btlm-3b-8k-base。

English

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.

BTLM-3B-8K：在3B參數模型中7B參數的表現

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

摘要

Support