POP:仅预填充剪枝实现高效大语言模型推理
POP: Prefill-Only Pruning for Efficient Large Model Inference
February 3, 2026
作者: Junhui He, Zhihui Fu, Jun Wang, Qingan Li
cs.AI
摘要
大语言模型(LLM)与视觉语言模型(VLM)已展现出卓越能力,但其部署受制于高昂的计算成本。现有结构化剪枝方法虽具备硬件效率优势,却常伴随显著的精度损失。本文指出,这种失效源于阶段无差别的剪枝策略忽视了预填充阶段与解码阶段之间的不对称性。通过引入虚拟门机制,我们的重要性分析表明:深层网络对下一令牌预测(解码)至关重要,但在上下文编码(预填充)阶段基本冗余。基于此发现,我们提出仅预填充剪枝法(POP)——一种阶段感知推理策略,在计算密集的预填充阶段安全跳过深层网络,同时为敏感的解码阶段保留完整模型。为实现阶段间切换,我们设计了独立的键值投影以维持缓存完整性,并采用边界处理策略确保首生成令牌的准确性。在Llama-3.1、Qwen3-VL和Gemma-3等多模态模型上的实验表明,POP可实现预填充延迟最高1.37倍的加速,且性能损失微乎其微,有效突破了现有结构化剪枝方法在精度与效率间的权衡局限。
English
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37times speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.