FlashPrefill：面向超高速长上下文预填充的瞬时模式发现与阈值化技术

摘要

长文本建模是大语言模型的关键能力，然而注意力机制的二次复杂度仍是核心瓶颈，尤其在计算密集的预填充阶段。尽管已有多种稀疏注意力机制被提出，但它们通常面临搜索延迟过高或稀疏度不足的问题。本文提出FlashPrefill框架，通过瞬时模式发现与阈值化实现超高速预填充。该框架采用快速分块搜索技术，能同时定位动态的垂直、斜向和分块稀疏注意力模式。其关键在于引入动态阈值机制，在避免排序或注意力分数累加的高昂开销的同时，有效消除长尾分布以提升稀疏度。大量实验表明，FlashPrefill实现了效率的跨越式提升，在256K序列长度上达到27.78倍的加速比。值得注意的是，与现有方法在短上下文场景出现性能衰减不同，FlashPrefill即使在4K上下文长度仍保持1.71倍加速，展现了其在不同序列尺度下的鲁棒性与实用价值。

English

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

FlashPrefill：面向超高速长上下文预填充的瞬时模式发现与阈值化技术

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

摘要

Support