擠壓-釋放：採用精確結構最小化的迭代剪枝

摘要

非结构化剪枝虽能产生稀疏的权重张量，但标准实现保持张量形状不变，使得部署后的模型大小与剪枝前无异。我们提出一种精确的结构化重写方法（称为最小化），能将掩码网络转化为更小的密集网络，且前向函数在浮点舍入误差范围内保持不变。压缩-释放周期迭代执行剪枝与最小化，并在中间引入释放步骤，将压缩张量内的精确零位置重新激活为经过校准的微小噪声，从而将原本被浪费的容量转化为可训练参数。后续周期利用这一容量发现单次剪枝无法触及的结构冗余。此外，我们提出补偿层归一化（CompensatedLayerNorm），这是一种保持函数功能的层归一化替代方案，可将最小化方法扩展到配备层归一化的残差流中的通道缩减任务。在准确率相当的前提下，压缩-释放周期能将全连接模型网络的可部署模型压缩至未剪枝模型的39倍，现代卷积神经网络（ConvNeXt-Tiny）则可压缩至14.8倍。此外，我们证明该重写方法可推广至Transformer架构。

English

Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.