Mixed-precision quantization for neural networks based on error limit (Invited)

Yiduo Li; Zibo Guo; Kai Liu; Xiaoyao Sun

doi:10.3788/IRLA20220166

Abstract

The deep learning algorithm based on convolutional neural network exhibits excellent performance, but also brings a complex amount of data and calculation. A large amout of storage and computing overhead has alse become the biggest obstacle to the deployment of such algorithms in hardware platforms.The neural network model quantization uses low-precision fixed-point numbers instead of high-precision floating-point numbers in the original model, which can effectively compress the model size, reduce hardware resource overhead, and improve model inference speed on the premise of losing less precision. Most of the existing quantization methods quantize the data of each layer to the same accuracy, while mixed-precision quantization sets different quantization accuracy according to the data distribution of different layers, aiming to achieve a higher model accuracy under the same compression ratio, but finding a suitable mixed-precision quantization strategy is still very difficult. Therefore, a mixed-precision quantization strategy based on error limitation was proposed. By uniformly and proportionally limiting the scaling factors in each layer of the neural network, the quantization accuracy of each layer was determined, and the truncation method was used to linearly quantize the weights and activate to low-precision fixed-point numbers. Under the same compression radio, this method had higher accuracy than the unified precision quantization method. Secondly, the classical object detection algorithm YOLOV5s based on convolutional neural network was used as the benchmark model to test the effect of the method. On the COCO data set and VOC data set, compared with the unified precision quantization, the mean average precision (mAP) of the model compressed to 5 bits was improved by 6% and 24.9%.