ChatPaper.aiChatPaper

Jamba:一個混合Transformer-Mamba語言模型

Jamba: A Hybrid Transformer-Mamba Language Model

March 28, 2024
作者: Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham
cs.AI

摘要

我們提出了 Jamba,一個基於新穎的混合Transformer-Mamba專家混合(MoE)架構的大型語言模型基礎。具體而言,Jamba交錯應用Transformer和Mamba層的區塊,享受兩個模型家族的優勢。在某些層中添加MoE以增加模型容量,同時保持活躍參數的可管理性。這種靈活的架構允許資源和目標特定的配置。在我們實施的特定配置中,我們得到了一個強大的模型,適合單個80GB GPU。Jamba在大規模下構建,相比於普通Transformer,提供高吞吐量和小內存佔用,同時在標準語言模型基準和長內文評估中達到最先進的性能。值得注意的是,該模型對長達256K個標記的上下文長度呈現出強大的結果。我們研究了各種架構決策,例如如何結合Transformer和Mamba層,以及如何混合專家,並展示其中一些在大規模建模中至關重要。我們還描述了這些架構的幾個有趣特性,這些特性是通過Jamba的訓練和評估揭示的,並計劃釋放各種消融運行的檢查點,以鼓勵對這種新穎架構進行進一步探索。我們將我們對Jamba實現的權重以寬鬆許可許可證公開發布。
English
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

Summary

AI-Generated Summary

PDF1115November 26, 2024