挤压-释放：基于精确结构最小化的迭代剪枝

摘要

非结构化剪枝会生成稀疏的权重张量，但标准实现中张量形状保持不变，导致部署后的模型并未比剪枝前更小。我们提出了一种精确的结构重写方法，称之为“最小化”，该方法能将掩码网络转换为更小的密集网络，且其前向函数在浮点舍入误差范围内保持一致。Squeeze-Release循环迭代执行剪枝与最小化，并在中间穿插一个释放步骤，将压缩张量内的精确零位置重新启用为小幅校准噪声，从而将原本浪费的容量转化回可训练参数。连续的循环利用这份容量，发现单次剪枝无法触及的结构冗余。此外，我们引入了CompensatedLayerNorm，这是一种保持函数不变的LayerNorm替代方案，它将最小化扩展至配备LayerNorm的残差流中的通道缩减。在精度相当的情况下，Squeeze-Release在全连接模型网络上将可部署网络压缩至未剪枝模型的39倍更小，在现代CNN（ConvNeXt-Tiny）上则达到14.8倍更小。此外，我们证明了该重写方法可扩展至Transformer架构。

English

Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.