FLIQS: ワンショット混合精度浮動小数点および整数量子化探索

要旨

量子化は、現代のディープニューラルネットワーク（DNN）のモデルサイズ、計算要件、エネルギー消費を削減するための主流の圧縮技術となっています。最近のハードウェアでは、整数や浮動小数点の複数のバリエーションを含む数値サポートが向上しており、高品質な結果を低いモデルコストで達成するために混合精度量子化が必要となっています。従来の混合精度量子化手法は、精度を犠牲にするポストトレーニング量子化探索を行うか、分岐によるメモリ使用量が増大する微分可能量子化探索を行っていました。そこで、我々は、整数および低精度浮動小数点モデルの両方で再トレーニングを不要とする初のワンショット混合精度量子化探索を提案します。我々は、複数の畳み込みネットワークおよびビジョントランスフォーマーモデルに対して浮動小数点および整数量子化探索（FLIQS）を評価し、パレート最適なモデルを発見します。我々のアプローチは、均一精度、手動混合精度、および最近の整数量子化探索手法を上回るモデルを発見します。提案された整数量子化探索により、ResNet-18のImageNetでの精度を1.31%、ResNet-50の精度を0.90%向上させ、同等のモデルコストで従来の手法を上回ります。さらに、初めて混合精度浮動小数点探索を探求し、MobileNetV2の精度を従来の最先端FP8モデルと比較して最大0.98%向上させます。最後に、FLIQSを拡張して量子化とニューラルアーキテクチャの同時探索を行い、MobileNetV2探索空間で同様のモデルコストでImageNetの精度を2.69%向上させます。

English

Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With the improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision quantization methods have performed a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our floating-point and integer quantization search (FLIQS) on multiple convolutional networks and vision transformer models to discover Pareto-optimal models. Our approach discovers models that improve upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With the proposed integer quantization search, we increase the accuracy of ResNet-18 on ImageNet by 1.31% points and ResNet-50 by 0.90% points with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% points compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2 search space.

FLIQS: ワンショット混合精度浮動小数点および整数量子化探索

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

要旨

Support