FlashMemory-DeepSeek-V4: 基于前瞻稀疏注意力的闪电索引超长上下文

摘要

传统大语言模型在解码过程中需完整加载KV缓存，导致超长上下文服务时出现严重的GPU内存瓶颈。本报告提出前瞻稀疏注意力机制（LSA），这是一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。与被动关注所有历史令牌不同，LSA主动预测未来上下文需求，仅保留查询关键KV块驻留在GPU内存中。关键创新在于，我们通过无主干网络的解耦训练策略实例化该架构：将索引器构建为标准双编码器结构，仅使用标准检索训练框架独立完成训练，全程无需将庞大的主干模型加载至GPU内存。我们证明这种"少即是多"范式能显著最大化服务效率，同时在依赖长期全局记忆的任务中充当有效的注意力降噪器。在主要长上下文评估套件（如LongBench-v2、LongMemEval和RULER）中，FM-DS-V4将平均物理KV缓存占用压缩至完整上下文基准的仅13.5%，同时持续保持或略微提升下游准确度（平均绝对增益+0.6%）。尤为关键的是，在极端500K规模下，FlashMemory将物理KV缓存开销抑制超过90%，且不破坏主干模型的核心推理能力。

English

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.