Jamba-1.5:規模化的混合Transformer-Mamba模型
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
August 22, 2024
作者: Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden M Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, Gal Shachaf, Haim Rozenblum, Hofit Bata, Ido Blass, Inbal Magar, Itay Dalmedigos, Jhonathan Osin, Julie Fadlon, Maria Rozman, Matan Danos, Michael Gokhman, Mor Zusman, Naama Gidron, Nir Ratner, Noam Gat, Noam Rozen, Oded Fried, Ohad Leshno, Omer Antverg, Omri Abend, Opher Lieber, Or Dagan, Orit Cohavi, Raz Alon, Ro'i Belson, Roi Cohen, Rom Gilad, Roman Glozman, Shahar Lev, Shaked Meirom, Tal Delbari, Tal Ness, Tomer Asida, Tom Ben Gal, Tom Braude, Uriya Pumerantz, Yehoshua Cohen, Yonatan Belinkov, Yuval Globerson, Yuval Peleg Levy, Yoav Shoham
cs.AI
摘要
我們提出Jamba-1.5,這是基於我們的Jamba架構的新型指令調整大型語言模型。Jamba是一種混合Transformer-Mamba專家結構,提供高吞吐量和低內存使用量,同時保留了與Transformer模型相同或更好的質量。我們釋出兩種模型尺寸:Jamba-1.5-Large,具有94B活躍參數,以及Jamba-1.5-Mini,具有12B活躍參數。這兩個模型都經過微調,用於各種對話和遵循指令的能力,並具有256K令牌的有效上下文長度,這是開放權重模型中最大的。為了支持具有成本效益的推理,我們引入了ExpertsInt8,這是一種新穎的量化技術,可以在處理256K令牌上下文時,無需降低質量即可將Jamba-1.5-Large安裝在具有8個80GB GPU的機器上。在一系列學術和聊天機器人基準測試中進行評估時,Jamba-1.5模型取得了出色的結果,同時提供高吞吐量,在長上下文基準測試中優於其他開放權重模型。兩種尺寸的模型權重根據Jamba開放模型許可證公開提供,我們也釋出ExpertsInt8作為開源。
English
We present Jamba-1.5, new instruction-tuned large language models based on
our Jamba architecture. Jamba is a hybrid Transformer-Mamba mixture of experts
architecture, providing high throughput and low memory usage across context
lengths, while retaining the same or better quality as Transformer models. We
release two model sizes: Jamba-1.5-Large, with 94B active parameters, and
Jamba-1.5-Mini, with 12B active parameters. Both models are fine-tuned for a
variety of conversational and instruction-following capabilties, and have an
effective context length of 256K tokens, the largest amongst open-weight
models. To support cost-effective inference, we introduce ExpertsInt8, a novel
quantization technique that allows fitting Jamba-1.5-Large on a machine with 8
80GB GPUs when processing 256K-token contexts without loss of quality. When
evaluated on a battery of academic and chatbot benchmarks, Jamba-1.5 models
achieve excellent results while providing high throughput and outperforming
other open-weight models on long-context benchmarks. The model weights for both
sizes are publicly available under the Jamba Open Model License and we release
ExpertsInt8 as open source.Summary
AI-Generated Summary