FlashPrefill: 초고속 장문 컨텍스트 프리필링을 위한 즉각적인 패턴 발견 및 임계값 처리

초록

장거리 문맥 모델링은 대규모 언어 모델의 핵심 능력이지만, 어텐션의 이차 복잡도는 특히 계산 집약적인 프리필 단계에서 중요한 병목 현상으로 남아 있습니다. 다양한 희소 어텐션 메커니즘이 연구되었지만, 이들은 일반적으로 상당한 검색 지연 시간이나 불충분한 희소성 문제를 겪습니다. 본 논문에서는 즉각적인 패턴 발견과 임계값 설정을 통해 초고속 프리필을 가능하게 하는 FlashPrefill 프레임워크를 제안합니다. FlashPrefill은 빠른 블록 탐색 기법을 활용하여 동적인 수직, 사선, 블록 희소 어텐션 패턴을 동시에 찾아냅니다. 중요한 것은, 정렬이나 어텐션 점수 누적의 과도한 오버헤드를 회피하면서도 긴 꼬리 분포를 효과적으로 제거하여 희소성을 향상시키는 동적 임계값 메커니즘을 도입한다는 점입니다. 광범위한 평가를 통해 FlashPrefill이 효율성에서 상당한 도약을 이루며, 256K 길이 시퀀스에서 전례 없는 27.78배의 속도 향상을 제공함을 입증했습니다. 특히, 기존 방법들과 달리 짧은 문맥에서 효율성 저하가 발생하지 않으며, 4K 문맥 길이에서도 1.71배의 속도 향상을 유지하여 다양한 시퀀스 규모에서의 견고성과 실용성을 입증했습니다.

English

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

FlashPrefill: 초고속 장문 컨텍스트 프리필링을 위한 즉각적인 패턴 발견 및 임계값 처리

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

초록

Support