UniPrefill：通過區塊動態稀疏化實現的通用長上下文預填充加速

摘要

隨著大型語言模型（LLMs）持續快速發展，其能力不斷增強，同時對上下文長度的需求也日益增加。為了提升長上下文處理的推論效率，近期出現了多種新穎的低複雜度混合架構，有效減輕了長上下文推論的計算負擔。然而，現有關於長上下文預填充加速的研究，仍主要集中於稀疏注意力機制，這類方法僅在完整注意力模型上能達到最大加速效果。當轉移至新興架構（例如線性/完整注意力混合模型或滑動視窗/完整注意力混合模型）時，這些預填充加速方法的效能會顯著下降。此外，這類方法通常與連續批次處理不相容，導致難以整合至如 vLLM 等現代推論引擎中。為此，我們提出 UniPrefill——一個適用於幾乎所有模型架構的預填充加速框架，可直接在詞元層級加速模型的計算。我們進一步將 UniPrefill 實作為連續批次處理運算子，並擴展 vLLM 的排程策略，使其原生支援 UniPrefill 的預填充與解碼協同處理及張量並行，從而實現與 vLLM 的無縫整合。UniPrefill 在首次詞元生成時間（TTFT）上可達到最高 2.1 倍的加速，且加速效果隨並行請求數量增加而更加顯著。

English

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures--such as linear/full attention hybrids or sliding window/full attention hybrids--these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.

UniPrefill：通過區塊動態稀疏化實現的通用長上下文預填充加速

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

摘要

Support