POP：专为高效大模型推理设计的预填充剪枝法

摘要

大型语言模型（LLM）与视觉语言模型（VLM）已展现出卓越的能力，但其部署受制于高昂的计算成本。现有结构化剪枝方法虽具备硬件效率优势，却常伴随显著的精度损失。本文认为该问题源于阶段不可知的剪枝策略忽视了预填充阶段与解码阶段的不对称性。通过引入虚拟门控机制，我们的重要性分析表明：深层网络对下一词元预测（解码）至关重要，但在上下文编码（预填充）过程中存在大量冗余。基于此发现，我们提出预填充专用剪枝（POP）——一种阶段感知推理策略，在计算密集的预填充阶段安全剔除深层网络，同时为敏感的解码阶段保留完整模型。为实现阶段间无缝切换，我们采用独立键值投影机制维持缓存完整性，并通过边界处理策略确保首生成词元的准确性。在Llama-3.1、Qwen3-VL和Gemma-3等多模态模型上的实验表明，POP可实现最高1.37倍的预填充加速比且性能损失极小，有效突破了现有结构化剪枝方法在精度与效率间的权衡局限。

English

Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37times speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

POP：专为高效大模型推理设计的预填充剪枝法

POP: Prefill-Only Pruning for Efficient Large Model Inference

摘要

Support