ChatPaper.aiChatPaper

FlashPrefill:面向超高速长上下文预填充的瞬时模式发现与阈值化技术

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

March 6, 2026
作者: Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, Ran He
cs.AI

摘要

长文本建模是大语言模型的关键能力,然而注意力机制的二次复杂度仍是核心瓶颈,尤其在计算密集的预填充阶段。尽管已有多种稀疏注意力机制被提出,但它们通常面临搜索延迟过高或稀疏度不足的问题。本文提出FlashPrefill框架,通过瞬时模式发现与阈值化实现超高速预填充。该框架采用快速分块搜索技术,能同时定位动态的垂直、斜向和分块稀疏注意力模式。其关键在于引入动态阈值机制,在避免排序或注意力分数累加的高昂开销的同时,有效消除长尾分布以提升稀疏度。大量实验表明,FlashPrefill实现了效率的跨越式提升,在256K序列长度上达到27.78倍的加速比。值得注意的是,与现有方法在短上下文场景出现性能衰减不同,FlashPrefill即使在4K上下文长度仍保持1.71倍加速,展现了其在不同序列尺度下的鲁棒性与实用价值。
English
Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.
PDF91May 8, 2026