ARMOR：基于自适应矩阵分解的高性能半结构化剪枝

摘要

大型语言模型（LLMs）因其庞大的计算和内存需求，在部署上面临着重大挑战。尽管半结构化剪枝，尤其是2:4稀疏性，为硬件加速提供了一条可行路径，但现有方法往往导致显著的性能下降。为弥合这一差距，我们提出了ARMOR（基于矩阵分解的自适应表示），一种新颖的一次性训练后剪枝算法。ARMOR不直接剪枝权重，而是将每个权重矩阵分解为一个2:4稀疏核心，并由两个低开销的块对角矩阵包裹。这些包裹层作为高效的前后变换误差校正器，相比传统的2:4剪枝技术，提供了更大的灵活性以保持模型质量。稀疏核心与块对角包裹层通过块坐标下降算法选择，该算法最小化逐层代理损失。我们从理论上证明，此优化保证收敛至一个代理损失小于或等于当前最先进剪枝算法的解。在Llama（Touvron等，2023；Dubey等，2024）和Qwen（Yang等，2025）模型家族上的实验表明，ARMOR在一系列下游任务和困惑度评估中，始终显著优于最先进的2:4剪枝方法。ARMOR在保持2:4剪枝带来的推理速度提升和内存使用大幅减少的同时，实现了更优的性能，从而在模型压缩与任务准确性之间建立了更为有效的权衡。

English

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

ARMOR：基于自适应矩阵分解的高性能半结构化剪枝

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

摘要

Support