FLIQS：一次混合精度浮点和整数量化搜索

摘要

量化已经成为一种主流的压缩技术，用于减小现代深度神经网络（DNNs）的模型大小、计算需求和能耗。随着近期硬件中改进的数字支持，包括多种整数和浮点数的变体，混合精度量化已经成为实现高质量结果和低模型成本的必要手段。先前的混合精度量化方法通常进行后训练量化搜索，这会影响准确性，或者进行可微量化搜索，但会导致分支带来的高内存使用。因此，我们提出了第一个一次性混合精度量化搜索，无需在整数和低精度浮点模型中重新训练。我们在多个卷积网络和视觉Transformer模型上评估了我们的浮点和整数量化搜索（FLIQS），以发现帕累托最优模型。我们的方法发现了优于均匀精度、手动混合精度和最近整数量化搜索方法的模型。通过提出的整数量化搜索，我们将ResNet-18在ImageNet上的准确性提高了1.31个百分点，ResNet-50提高了0.90个百分点，与先前方法相比，模型成本相当。此外，我们首次探索了一种新颖的混合精度浮点搜索，并将MobileNetV2的准确性提高了高达0.98个百分点，与先前最先进的FP8模型相比。最后，我们将FLIQS扩展到同时搜索联合量化和神经架构空间，并在MobileNetV2搜索空间上将ImageNet准确性提高了2.69个百分点，模型成本相似。

English

Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With the improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision quantization methods have performed a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our floating-point and integer quantization search (FLIQS) on multiple convolutional networks and vision transformer models to discover Pareto-optimal models. Our approach discovers models that improve upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With the proposed integer quantization search, we increase the accuracy of ResNet-18 on ImageNet by 1.31% points and ResNet-50 by 0.90% points with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% points compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2 search space.

FLIQS：一次混合精度浮点和整数量化搜索

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

摘要

Support