FlashPrefill: Onmiddellijke Patroonontdekking en Drempelwaarde-instelling voor Ultraznelle Prefill van Lange Contexten

Samenvatting

Lang-context modellering is een cruciale capaciteit voor grote taalmodellen, maar de kwadratische complexiteit van aandacht (attention) blijft een kritieke bottleneck, vooral tijdens de rekenintensieve prefase (prefilling). Hoewel diverse sparse attention-mechanismen zijn onderzocht, lijden deze doorgaans onder aanzienlijke zoeklatentie of onvoldoende sparsiteit. In dit artikel presenteren we FlashPrefill, een raamwerk dat ultra-snelle prefase mogelijk maakt via directe patroondetectie en thresholding. FlashPrefill benut een snelle blokzoektechniek om gelijktijdig dynamische verticale, diagonale en blok-sparse aandachtspatronen te lokaliseren. Cruciaal is de introductie van een dynamisch thresholding-mechanisme dat de verboden overhead van sorteren of accumuleren van attentiescores omzeilt, terwijl het de long-tail distributie effectief elimineert om sparsiteit te verbeteren. Uitgebreide evaluaties tonen aan dat FlashPrefill een substantiële efficiëntiesprong realiseert, met een ongekende 27.78x versnelling op 256K sequenties. Opmerkelijk is dat, in tegenstelling tot bestaande methoden die efficiëntieverlies vertonen bij kortere contexten, FlashPrefill een 1.71x versnelling handhaaft zelfs bij een contextlengte van 4K, wat de robuustheid en praktische bruikbaarheid over verschillende sequentieschalen aantoont.

English

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

FlashPrefill: Onmiddellijke Patroonontdekking en Drempelwaarde-instelling voor Ultraznelle Prefill van Lange Contexten

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

Samenvatting

Support