Jamba:一种混合Transformer-Mamba语言模型
Jamba: A Hybrid Transformer-Mamba Language Model
March 28, 2024
作者: Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham
cs.AI
摘要
我们提出了Jamba,这是一种基于新型混合Transformer-Mamba专家混合(MoE)架构的新型大型语言模型。具体而言,Jamba交错堆叠Transformer层和Mamba层,从而兼具两类模型的优势。在部分层中加入MoE,以提升模型容量,同时保持活跃参数的可控性。这种灵活架构支持资源和目标特定的配置。在我们实现的特定配置中,最终得到一个强大的模型,能够适应单个80GB GPU。Jamba在大规模构建时,相比传统Transformer,提供了更高的吞吐量和更小的内存占用,同时在标准语言模型基准测试和长上下文评估中达到了最先进的性能。值得注意的是,该模型在长达256K个token的上下文长度下表现出色。我们研究了多种架构决策,如如何结合Transformer和Mamba层,以及如何混合专家,并表明其中一些决策在大规模建模中至关重要。我们还描述了这些架构的几个有趣特性,这些特性是通过Jamba的训练和评估揭示的,并计划发布来自各种消融运行的检查点,以鼓励对该新型架构的进一步探索。我们以宽松的许可协议公开了Jamba实现中的权重。
English
We present Jamba, a new base large language model based on a novel hybrid
Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba
interleaves blocks of Transformer and Mamba layers, enjoying the benefits of
both model families. MoE is added in some of these layers to increase model
capacity while keeping active parameter usage manageable. This flexible
architecture allows resource- and objective-specific configurations. In the
particular configuration we have implemented, we end up with a powerful model
that fits in a single 80GB GPU. Built at large scale, Jamba provides high
throughput and small memory footprint compared to vanilla Transformers, and at
the same time state-of-the-art performance on standard language model
benchmarks and long-context evaluations. Remarkably, the model presents strong
results for up to 256K tokens context length. We study various architectural
decisions, such as how to combine Transformer and Mamba layers, and how to mix
experts, and show that some of them are crucial in large scale modeling. We
also describe several interesting properties of these architectures which the
training and evaluation of Jamba have revealed, and plan to release checkpoints
from various ablation runs, to encourage further exploration of this novel
architecture. We make the weights of our implementation of Jamba publicly
available under a permissive license.Summary
AI-Generated Summary