ARMOR:基於自適應矩陣分解的高效能半結構化剪枝
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
October 7, 2025
作者: Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang
cs.AI
摘要
大型語言模型(LLMs)因其龐大的計算和記憶體需求而面臨顯著的部署挑戰。雖然半結構化剪枝,尤其是2:4稀疏性,提供了一條通往實際硬體加速的途徑,但現有方法往往會導致性能大幅下降。為彌補這一差距,我們引入了ARMOR(基於矩陣分解的自適應表示),這是一種新穎的一次性訓練後剪枝算法。ARMOR並非直接剪枝權重,而是將每個權重矩陣分解為一個2:4稀疏核心,並由兩個低開銷的塊對角矩陣包裹。這些包裹矩陣作為高效的前後轉換誤差校正器,相比傳統的2:4剪枝技術,提供了更大的靈活性以保持模型質量。稀疏核心和塊對角包裹矩陣通過塊座標下降算法選擇,該算法最小化層級代理損失。我們從理論上證明,這種優化保證收斂到一個代理損失小於或等於最先進剪枝算法的解。在Llama(Touvron等,2023;Dubey等,2024)和Qwen(Yang等,2025)模型系列上的實驗表明,ARMOR在多種下游任務和困惑度評估中始終顯著優於最先進的2:4剪枝方法。ARMOR在保持2:4剪枝的推理加速和大幅減少記憶體使用的同時,實現了這種卓越性能,從而在模型壓縮和任務準確性之間建立了更有效的平衡。
English
Large language models (LLMs) present significant deployment challenges due to
their immense computational and memory requirements. While semi-structured
pruning, particularly 2:4 sparsity, offers a path to practical hardware
acceleration, existing methods often incur substantial performance degradation.
To bridge this gap, we introduce ARMOR: (Adaptive Representation with
Matrix-factORization), a novel one-shot post-training pruning algorithm.
Instead of directly pruning weights, ARMOR factorizes each weight matrix into a
2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These
wrappers act as efficient pre and post-transformation error correctors,
offering greater flexibility to preserve model quality compared to conventional
2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen
through a block coordinate descent algorithm that minimizes a layer-wise proxy
loss. We theoretically prove this optimization is guaranteed to converge to a
solution with a proxy loss less than or equal to state-of-the-art pruning
algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and
Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and
significantly outperforms state-of-the-art 2:4 pruning methods across a wide
range of downstream tasks and perplexity evaluations. ARMOR achieves this
superior performance while retaining the inference speedups and substantial
memory usage reductions of 2:4 pruning, establishing a more effective trade-off
between model compression and task accuracy