ARMOR：基於自適應矩陣分解的高效能半結構化剪枝

摘要

大型語言模型（LLMs）因其龐大的計算和記憶體需求而面臨顯著的部署挑戰。雖然半結構化剪枝，尤其是2:4稀疏性，提供了一條通往實際硬體加速的途徑，但現有方法往往會導致性能大幅下降。為彌補這一差距，我們引入了ARMOR（基於矩陣分解的自適應表示），這是一種新穎的一次性訓練後剪枝算法。ARMOR並非直接剪枝權重，而是將每個權重矩陣分解為一個2:4稀疏核心，並由兩個低開銷的塊對角矩陣包裹。這些包裹矩陣作為高效的前後轉換誤差校正器，相比傳統的2:4剪枝技術，提供了更大的靈活性以保持模型質量。稀疏核心和塊對角包裹矩陣通過塊座標下降算法選擇，該算法最小化層級代理損失。我們從理論上證明，這種優化保證收斂到一個代理損失小於或等於最先進剪枝算法的解。在Llama（Touvron等，2023；Dubey等，2024）和Qwen（Yang等，2025）模型系列上的實驗表明，ARMOR在多種下游任務和困惑度評估中始終顯著優於最先進的2:4剪枝方法。ARMOR在保持2:4剪枝的推理加速和大幅減少記憶體使用的同時，實現了這種卓越性能，從而在模型壓縮和任務準確性之間建立了更有效的平衡。

English

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

ARMOR：基於自適應矩陣分解的高效能半結構化剪枝

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

摘要

Support