ARMOR: 適応的マトリックス分解による高性能半構造化プルーニング

要旨

大規模言語モデル（LLMs）は、その膨大な計算量とメモリ要件により、実用化において大きな課題を抱えている。半構造化プルーニング、特に2:4スパース性は、実用的なハードウェア加速への道を提供するが、既存の手法ではしばしば大幅な性能低下が生じる。このギャップを埋めるため、我々はARMOR（Adaptive Representation with Matrix-factORization）を提案する。これは、新規のワンショット学習後プルーニングアルゴリズムである。ARMORは、重みを直接プルーニングする代わりに、各重み行列を2:4スパースコアと、それを包む2つの低オーバーヘッドなブロック対角行列に分解する。これらのラッパーは、効率的な前処理および後処理誤差補正器として機能し、従来の2:4プルーニング技術と比較して、モデルの品質を維持するための柔軟性を提供する。スパースコアとブロック対角ラッパーは、層ごとの代理損失を最小化するブロック座標降下アルゴリズムを通じて選択される。我々は、この最適化が代理損失が最先端のプルーニングアルゴリズム以下に収束する解を保証することを理論的に証明する。Llama（Touvron et al., 2023; Dubey et al., 2024）およびQwen（Yang et al., 2025）モデルファミリーを用いた実験により、ARMORが幅広い下流タスクおよびパープレキシティ評価において、最先端の2:4プルーニング手法を一貫して大幅に上回ることを示す。ARMORは、2:4プルーニングの推論速度向上と大幅なメモリ使用量削減を維持しつつ、モデル圧縮とタスク精度の間により効果的なトレードオフを確立する。

English

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

ARMOR: 適応的マトリックス分解による高性能半構造化プルーニング

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

要旨

Support