MSA:面向高效端到端内存模型的记忆稀疏注意力机制,支持1亿令牌规模扩展
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
March 6, 2026
作者: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
cs.AI
摘要
长期记忆是人类智能的基石。让AI具备处理终身尺度信息的能力,一直是该领域的长期追求。由于全注意力架构的限制,大语言模型(LLMs)的有效上下文长度通常被限制在100万标记内。现有方法如混合线性注意力、固定大小的记忆状态(如RNNs),以及RAG或智能体系统等外部存储方案,虽试图突破这一限制,但普遍存在精度随上下文增长急剧下降、延迟快速攀升、无法动态修改记忆内容或缺乏端到端优化等问题。这些瓶颈制约了大语料摘要、数字孪生和长历史智能体推理等复杂场景的应用,同时限制了记忆容量并拖慢推理速度。我们提出记忆稀疏注意力(MSA),一种端到端可训练、高效且具备海量扩展能力的记忆模型框架。通过可扩展稀疏注意力和文档级RoPE等核心创新,MSA在训练和推理中均实现线性复杂度,并保持卓越的稳定性——从16K标记扩展到1亿标记时性能衰减不足9%。结合KV缓存压缩与记忆并行技术,MSA可在2xA800 GPU上实现1亿标记推理。我们还提出记忆交错技术,以促进跨分散记忆片段的复杂多跳推理。在长上下文基准测试中,MSA显著超越前沿LLMs、顶尖RAG系统及主流记忆智能体。这些结果表明,通过解耦记忆容量与推理能力,MSA为通用模型赋予本质性的终身尺度记忆提供了可扩展的基础架构。
English
Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in
the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically
limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage
methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly
increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These
bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory
capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory
model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both
training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens.
Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory
Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs,
state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory
capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.