Samba：用於高效無限上下文語言建模的簡單混合狀態空間模型

摘要

有效地建模具有無限上下文長度的序列一直是一個長期存在的問題。過去的研究要麼受制於二次計算複雜度，要麼在長度泛化上具有有限的外推能力。在這項研究中，我們提出了Samba，一種簡單的混合架構，它將具有選擇性狀態空間模型（SSM）Mamba與滑動窗口注意力（SWA）逐層結合。Samba將一個給定的序列有選擇性地壓縮成循環隱藏狀態，同時仍保持著通過注意力機制精確回憶記憶的能力。我們將Samba擴展到38億個參數，使用32萬億個訓練標記，並展示Samba在廣泛的基準測試中顯著優於純注意力或SSM模型的最新模型。當在長度為4K的序列上進行訓練時，Samba可以有效地外推到256K的上下文長度，實現完美的記憶回憶，並在高達100萬上下文長度時顯示改進的標記預測。作為一種線性時間序列模型，Samba在處理128K長度用戶提示時比具有分組查詢注意力的Transformer擁有3.73倍的吞吐量，並在生成64K標記並具有無限流時加速3.64倍。Samba的一個示例實現可在https://github.com/microsoft/Samba上公開獲得。

English

Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.

Samba：用於高效無限上下文語言建模的簡單混合狀態空間模型

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

摘要

Support