FlashMemory-DeepSeek-V4: 선견 희소 어텐션을 통한 라이트닝 인덱스 초장기 문맥

초록

기존 LLM은 디코딩 중에 전체 KV 캐시를 메모리에 유지하므로, 초장문 컨텍스트 서비스에 심각한 GPU 메모리 병목 현상을 초래한다. 본 보고서에서는 DeepSeek-V4 아키텍처를 기반으로 구축된 신경 메모리 인덱서(Neural Memory Indexer)를 활용한 새로운 추론 패러다임인 사전 탐색 희소 어텐션(Lookahead Sparse Attention, LSA)을 제안한다. LSA는 모든 과거 토큰에 수동적으로 어텐션을 수행하는 대신, 미래 컨텍스트 요구를 사전에 예측하여 쿼리에 중요한 KV 청크만 GPU 메모리에 유지한다. 핵심적으로, 우리는 백본 없는 분리 훈련 전략(backbone-free decoupled training strategy)을 통해 이 아키텍처를 구현한다. 인덱서를 표준 이중 인코더 아키텍처로 구성하고, 방대한 백본 모델을 GPU 메모리에 로드하지 않은 상태에서 표준 검색 훈련 프레임워크를 사용하여 독립적으로 훈련시킨다. 이러한 "적을수록 더 많다" 패러다임이 서빙 효율성을 크게 극대화할 뿐만 아니라, 장기 전역 메모리에 의존하는 과제에서 효과적인 어텐션 노이즈 제거기(attention denoiser)로 작용함을 입증한다. 주요 장문 컨텍스트 평가 제품군(예: LongBench-v2, LongMemEval, RULER)에서 FM-DS-V4는 물리적 KV 캐시 풋프린트를 전체 컨텍스트 기준선의 평균 13.5%로 압축하면서도, 다운스트림 정확도를 일관되게 유지하거나 소폭 향상시킨다(평균 +0.6% 절대 차이). 특히 극단적인 500K 규모에서 FlashMemory는 물리적 KV 캐시 오버헤드를 90% 이상 억제하면서 백본의 핵심 추론 능력을 안정적으로 유지한다.

English

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.