ARMOR: Hoogwaardige semi-gestructureerde pruning via adaptieve matrixfactorisatie

Samenvatting

Grote taalmmodellen (LLM's) brengen aanzienlijke implementatie-uitdagingen met zich mee vanwege hun immense rekenkundige en geheugenvereisten. Hoewel semi-gestructureerd snoeien, met name 2:4-sparsiteit, een weg biedt naar praktische hardwareversnelling, leiden bestaande methoden vaak tot aanzienlijke prestatievermindering. Om deze kloof te overbruggen, introduceren we ARMOR: (Adaptive Representation with Matrix-factORization), een innovatief one-shot post-training snoeialgoritme. In plaats van direct gewichten te snoeien, factoriseert ARMOR elke gewichtsmatrix in een 2:4-sparse kern omhuld door twee laag-overhead, blokdiagonale matrices. Deze omhulsels fungeren als efficiënte pre- en post-transformatie foutcorrectoren, wat meer flexibiliteit biedt om modelkwaliteit te behouden in vergelijking met conventionele 2:4-snoeitechnieken. De sparse kern en blokdiagonale omhulsels worden gekozen via een blokcoördinaat-dalingsalgoritme dat een laagsgewijze proxyverlies minimaliseert. We bewijzen theoretisch dat deze optimalisatie gegarandeerd convergeert naar een oplossing met een proxyverlies dat kleiner dan of gelijk is aan state-of-the-art snoeialgoritmen. Experimenten op de Llama (Touvron et al., 2023; Dubey et al., 2024) en Qwen (Yang et al., 2025) modelfamilies tonen aan dat ARMOR consistent en significant beter presteert dan state-of-the-art 2:4-snoeimethoden over een breed scala aan downstream taken en perplexiteitsevaluaties. ARMOR bereikt deze superieure prestaties terwijl het de inferentieversnellingen en aanzienlijke geheugengebruiksreducties van 2:4-snoeien behoudt, waardoor een effectievere afweging tussen modelcompressie en taaknauwkeurigheid wordt gevestigd.

English

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

ARMOR: Hoogwaardige semi-gestructureerde pruning via adaptieve matrixfactorisatie

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Samenvatting

Support