ChatPaper.aiChatPaper

Samba:用于高效无限上下文语言建模的简单混合状态空间模型

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

June 11, 2024
作者: Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen
cs.AI

摘要

高效地对具有无限上下文长度的序列进行建模一直是一个长期存在的问题。过去的研究要么受到二次计算复杂度的影响,要么在长度泛化方面具有有限的外推能力。在本文中,我们提出了 Samba,这是一个简单的混合架构,它将选择性状态空间模型(SSM)Mamba 与滑动窗口注意力(SWA)逐层结合起来。Samba能够将给定序列有选择性地压缩为循环隐藏状态,同时仍然保持着通过注意力机制精确回忆记忆的能力。我们将 Samba 扩展到了拥有 38 亿参数、32 万亿训练标记的规模,并展示了 Samba 在各种基准测试中明显优于基于纯注意力或 SSM 的最先进模型。当在长度为 4K 的序列上训练时,Samba 可以高效地外推到长度为 256K 的上下文长度,具有完美的记忆回溯,并在长度为 1M 的上下文长度上展现出改进的标记预测。作为一个线性时间序列模型,Samba 在处理长度为 128K 的用户提示时,与具有分组查询注意力的 Transformer 相比,吞吐量提高了 3.73 倍,并在生成 64K 标记且具有无限流式处理时加快了 3.64 倍。Samba 的一个示例实现可在 https://github.com/microsoft/Samba 上公开获取。
English
Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.

Summary

AI-Generated Summary

PDF394December 6, 2024