MSA: 100Mトークンまでの効率的なエンドツーエンドメモリモデルスケーリングのためのメモリ疎注意機構

要旨

長期記憶は人間の知能の基盤である。AIに生涯規模の情報処理能力を付与することは、本分野における長年の課題として続いている。完全注意機構の制約により、大規模言語モデル（LLM）の実効的なコンテキスト長は通常1Mトークンに限定されている。既存のアプローチであるハイブリッド線形注意、固定サイズのメモリ状態（RNNなど）、RAGやエージェントシステムのような外部記憶手法は、この制限を拡張しようと試みている。しかし、コンテキスト長の増加に伴う深刻な精度劣化や遅延時間の急増、メモリ内容の動的変更不能性、あるいはエンドツーエンド最適化の欠如といった課題に悩まされている。これらのボトルネックは、大規模コーパス要約、デジタルツイン、長期履歴を用いたエージェント推論といった複雑なシナリオを妨げるとともに、メモリ容量を制限し推論速度を低下させている。本論文では、Memory Sparse Attention（MSA）を提案する。これはエンドツーエンド学習可能で効率的、かつ大規模スケーラブルなメモリモデルフレームワークである。スケーラブルなスパース注意と文書単位のRoPEといった中核的革新により、MSAは優れた安定性を維持しつつ、学習と推論の両方で線形計算量を実現しており、16Kトークンから100Mトークンへスケーリング時の精度劣化は9%未満に抑えられている。さらに、KVキャッシュ圧縮とMemory Parallelを組み合わせることで、2台のA800 GPU上での100Mトークン推論を可能にした。また、分散したメモリセグメント間での複雑なマルチホップ推論を促進するMemory Interleavingも提案する。MSAは、長文コンテキストベンチマークにおいて、最先端のLLM、最新のRAGシステム、主要なメモリエージェントを大幅に上回る性能を示した。これらの結果は、MSAが記憶容量と推論処理を分離することで、汎用モデルに本質的かつ生涯規模の記憶能力を付与するスケーラブルな基盤を提供することを実証している。

English

Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.

MSA: 100Mトークンまでの効率的なエンドツーエンドメモリモデルスケーリングのためのメモリ疎注意機構

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

要旨

Support