MSA:記憶稀疏注意力機制——實現端到端記憶模型高效擴展至1億詞元的關鍵技術
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
March 6, 2026
作者: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
cs.AI
摘要
長期記憶是人類智能的基石。讓人工智慧具備處理終身規模資訊的能力,始終是該領域長期追求的目標。由於全注意力架構的限制,大型語言模型的有效上下文長度通常被限制在100萬個標記以內。現有方法如混合線性注意力、固定大小的記憶狀態(例如循環神經網路),以及像RAG或代理系統的外部儲存方案,都試圖突破這一限制。然而這些方法普遍存在精度急劇衰減、延遲隨上下文長度快速增長、無法動態修改記憶內容,或缺乏端到端最佳化等問題。這些瓶頸阻礙了大規模語料摘要、數位孿生和長歷史代理推理等複雜場景的實現,同時限制了記憶容量並拖慢推理速度。我們提出記憶稀疏注意力(MSA),這是一個端到端可訓練、高效且具備大規模擴展能力的記憶模型框架。通過可擴展稀疏注意力和文件級RoPE等核心創新,MSA在訓練和推理階段均實現線性複雜度,並保持卓越的穩定性——在從16K擴展到1億標記時精度衰減不足9%。此外,KV快取壓縮技術與記憶並行架構相結合,可在2張A800 GPU上實現1億標記的推理。我們還提出記憶交織技術,以促進分散記憶片段間的複雜多跳推理。在長上下文基準測試中,MSA顯著超越了前沿大型語言模型、頂尖RAG系統和領先的記憶代理系統。這些成果表明,通過將記憶容量與推理能力解耦,MSA為通用模型賦予內在的終身規模記憶提供了可擴展的基礎架構。
English
Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in
the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically
limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage
methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly
increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These
bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory
capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory
model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both
training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens.
Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory
Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs,
state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory
capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.