UniPrefill：基于分块动态稀疏化的通用长上下文预填充加速

摘要

随着大语言模型（LLMs）的快速发展，其能力不断提升，同时所需上下文长度也在持续增加。为提升长上下文处理的推理效率，近期提出了多种新型低复杂度混合架构，有效缓解了长上下文推理的计算负担。然而，现有关于长上下文预填充加速的研究仍主要聚焦于稀疏注意力机制，这类方法仅在纯注意力模型上达到最大加速效果。当应用于新兴架构（如线性/全注意力混合或滑动窗口/全注意力混合）时，这些预填充加速方法的性能会显著下降。此外，此类方法通常与连续批处理不兼容，难以集成到vLLM等现代推理引擎中。为此，我们提出UniPrefill——一种适用于几乎所有模型架构的预填充加速框架，可直接在令牌级别加速模型计算。我们进一步将UniPrefill实现为连续批处理算子，并扩展vLLM的调度策略，使其原生支持UniPrefill的预填充-解码协同处理与张量并行，从而无缝集成至vLLM中。UniPrefill在首次令牌延迟（TTFT）上实现了最高2.1倍的加速，且随着并发请求数量的增加，加速效果愈发显著。

English

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures--such as linear/full attention hybrids or sliding window/full attention hybrids--these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.

UniPrefill：基于分块动态稀疏化的通用长上下文预填充加速

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

摘要

Support