Zamba:一种紧凑的7B SSM混合模型
Zamba: A Compact 7B SSM Hybrid Model
May 26, 2024
作者: Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge
cs.AI
摘要
在这份技术报告中,我们介绍了Zamba,这是一种新颖的7B SSM-Transformer混合模型,能够在可比规模下与领先的开放权重模型竞争。Zamba是在公开可用数据集中训练的,涵盖了1T个标记,并且是在这一规模下表现最佳的非Transformer模型。Zamba开创了一种独特的架构,将Mamba骨干与单个共享注意力模块相结合,从而以最小的参数成本获得注意力的好处。由于其架构,Zamba在推理速度上明显快于可比的Transformer模型,并且在生成长序列时需要的内存大大减少。Zamba的预训练分为两个阶段:第一阶段基于现有网络数据集,而第二阶段包括对模型进行高质量指导和合成数据集的退火,其特点是快速学习率衰减。我们通过开源方式提供了Zamba的权重和所有检查点,涵盖了第一阶段和退火阶段。
English
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid
model which achieves competitive performance against leading open-weight models
at a comparable scale. Zamba is trained on 1T tokens from openly available
datasets and is the best non-transformer model at this scale. Zamba pioneers a
unique architecture combining a Mamba backbone with a single shared attention
module, thus obtaining the benefits of attention at minimal parameter cost. Due
to its architecture, Zamba is significantly faster at inference than comparable
transformer models and requires substantially less memory for generation of
long sequences. Zamba is pretrained in two phases: the first phase is based on
existing web datasets, while the second one consists of annealing the model
over high-quality instruct and synthetic datasets, and is characterized by a
rapid learning rate decay. We open-source the weights and all checkpoints for
Zamba, through both phase 1 and annealing phases.