ChatPaper.aiChatPaper

Zamba:一個緊湊的 7B SSM 混合模型

Zamba: A Compact 7B SSM Hybrid Model

May 26, 2024
作者: Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge
cs.AI

摘要

在這份技術報告中,我們介紹了 Zamba,一個新穎的 7B SSM-Transformer 混合模型,能夠在相當規模上與領先的開放權重模型競爭。Zamba 是在公開可用數據集中訓練的,總共有 1T 個標記,並且是這個規模下最優秀的非 Transformer 模型。Zamba 開創了一種獨特的架構,將 Mamba 骨幹與一個共享的注意力模塊結合,因此以最小的參數成本獲得了注意力的好處。由於其架構,Zamba 在推論時比可比的 Transformer 模型快得多,並且在生成長序列時需要的記憶體明顯較少。Zamba 的預訓練分為兩個階段:第一階段基於現有的網絡數據集,而第二階段則包括對模型進行高質量指導和合成數據集的退火,並以快速的學習速率衰減為特徵。我們通過開源方式提供了 Zamba 的權重和所有檢查點,包括第一階段和退火階段。
English
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.

Summary

AI-Generated Summary

PDF256December 12, 2024