FlashMemory-DeepSeek-V4：透過前瞻稀疏注意力實現閃電索引超長上下文

摘要

傳統大型語言模型在解碼過程中會保留完整的 KV 快取，導致超長語境服務時出現嚴重的 GPU 記憶體瓶頸。本報告提出「前瞻稀疏注意力」（Lookahead Sparse Attention, LSA），這是一種基於 DeepSeek-V4 架構、由神經記憶索引器驅動的新型推論範式。LSA 不再被動地關注所有歷史詞元，而是主動預測未來的語境需求，僅將查詢關鍵的 KV 區塊保留在 GPU 記憶體中。關鍵在於，我們透過無主幹的分離訓練策略來實例化此架構。透過將索引器設計為標準的雙編碼器架構，我們使用標準的檢索訓練框架獨立訓練它，而無需將龐大的主幹模型載入 GPU 記憶體。我們證明，這種「少即是多」的範式能顯著最大化服務效率，同時在依賴長期全域記憶的任務中充當有效的注意力去噪器。在多個主要的長語境評測套件（例如 LongBench-v2、LongMemEval 及 RULER）中，FM-DS-V4 將平均物理 KV 快取佔用壓縮至僅為全語境基線的 13.5%，同時持續保持或略微提升下游準確率（平均絕對邊際提升 +0.6%）。關鍵在於，在極端 500K 規模下，FlashMemory 將物理 KV 快取開銷壓低超過 90%，且不影響主幹核心推理能力的穩定性。

English

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.