• Journal of Infrared and Millimeter Waves
  • Vol. 41, Issue 5, 914 (2022)
Shao-Yi CHEN1,2,3,4, Xin-Yi TANG2,3,4, Jian WANG2,3,4, Jing-Si HUANG1,2,3,4, and Zheng LI2,3,4,*
Author Affiliations
  • 1School of Information Science and Technology,Shanghai Tech University,Shanghai 201210,China
  • 2Shanghai Institute of Technical Physics,Chinese Academy of Sciences,Shanghai 20083,China
  • 3University of Chinese Academy of Sciences,Beijing 100049,China
  • 4Key Laboratory of Infrared System Detection and Imaging Technology,Chinese Academy of Sciences,Shanghai 200083,China
  • show less
    DOI: 10.11972/j.issn.1001-9014.2022.05.016 Cite this Article
    Shao-Yi CHEN, Xin-Yi TANG, Jian WANG, Jing-Si HUANG, Zheng LI. An ultra-efficient streaming-based FPGA accelerator for infrared target detection[J]. Journal of Infrared and Millimeter Waves, 2022, 41(5): 914 Copy Citation Text show less

    Abstract

    Object detection algorithm based on deep learning has achieved great success, significantly better than the effect of traditional algorithms, and even surpassed human in many scenarios. Unlike RGB cameras, infrared cameras can see objects even in the dark, which can be used in many fields like surveillance and autonomous driving. In this paper, a lightweight target detection algorithm for embedded devices is proposed, which is accelerated and deployed using Xilinx Ultrascale+MPSoC FPGA ZU3EG. The accelerator runs at a 350 MHz frequency clock with throughput of 551 FPS and power of only 8.4 W. The intersection over union (IoU) of the algorithm achieves an accuracy of 73.6% on FILR datasets. Comparing with the previous work, the accelerator design improves performance by 2.59× and reduces 49.02% of the power consumption.

    Introduction

    Infrared systems have the unique advantage of resolving objects in dark environment or bad weather. Nowosielski et al. 1 took advantage of this feature and developed a system to expand human vision,which was used to detect pedestrians in driving at night to improve vehicle safety. Mushahar et al. 2 deployed an infrared system at the entrance of a public place to take contactless temperature measurements while detecting pedestrians and prohibiting people with excessive body temperature from entering the place.

    Although the infrared image has many advantages,it lacks contrast and edge information compared with the visible image. Akula et al. 3 proposed WignerMSER to solve this problem,which is a new detector based on the local feature of infrared image,and is used to enhance the effect of target detection. Traditional image processing methods are often used in resource-constrained embedded systems because of low computational cost. In order to enhance the detection effect,a two-stage image processing method is generally adopted. Wu et al. 4 extracted the candidate box of pedestrians by using an adaptive threshold and vertical edge operator. Then,they carried out the bounding box in real-time of far-infrared pedestrians using the morphological method. Piniarski et al. 5 first preprocessed the image with two global thresholds to expand the region of interest and then performed pedestrian segmentation. Ragb et al. 6 also adopted a two-stage algorithm. Gradient information and texture information were used to obtain local information of the image. Then,the superpixel algorithm was used to find the detailed region without background information.

    Deep learning shows more robust performance than traditional algorithms in the field of target detection and is widely used in the field of infrared target detection. Li et al.7 proposed YOLO-FIRI based on YOLOv5 to solve the target detection problem of long distances,weak energy,and low resolution in infrared images. Yun et al. 8 combined neural network YOLO and Long Short-Term Memory(LSTM)to detect the problem of occluded infrared targets. Narayanan et al. 9 first used the YOLO network to extract features and then combined it with the support vector machine(SVM)classifier to classify(what).

    Neural network algorithm has achieved good results in target detection,but there is huge challenge in embedded deployment. To deploy neural network algorithms on embedded devices usually has many limitations,including real-time,limited computing units,and restricted on-chip memory. Huang et al. 10 implemented the YOLO-tiny neural network accelerator,quantifying the parameters of the network to 2-bit for deployment of low-cost development boards. However,the computational performance of their accelerator only reaches 90.6 GOP/s on the PYNQ-Z2 development board,which is difficult to be applied to practical scenarios. In order to deploy the neural network VGG16+SSD on the PYNQ-Z1 development board with fewer resources,Kang 11 reduced the weight by 87.5% through accelerator-aware pruning. However,it is a challenge to achieve such a high pruning rate on all networks. In recent years,systolic array has been widely used in the accelerator design because of its high throughput. The potential for designing a much more efficient data path still remains to be explored. The performance density of the YOLO-tiny accelerator accelerated by Li12 based on systolic arrays is only 0.165 GOPS/DSP.

    FPGA can generate the corresponding structure accelerator according to the low precision weight. This feature gives full play to all the advantages of the design. There have been many studies on deep learning accelerators based on FPGA. Lee et al. 13 proposed a structure of two multiply-accumulate(MAC)operations on one DSP. In this design,a subtraction of the left-shifted multiplier follows a double MAC processing unit. This operation is usually done by look-up tables(LUTs),where comes the bottleneck of the system. Fu et al. explored how to optimize operations in deep learning on DSP slices:multiplication and addition of integers with 8-bit width. However,their design only brought about a 1.75× performance improvement compared ideally to a 2× improvement due to the bottleneck of DSP bit width. The existing accelerators could not thoroughly combine the characteristics of FPGA architecture from the studies above.

    In order to solve this problem,we first put forward a lightweight neural network algorithm for infrared target detection. Then,we use high-level synthesis(HLS)to implement the convolutional neural network accelerator and deploy this algorithm on the Ultra96v2 development board. Finally,according to the characteristics of the FPGA structure,we realize the 2× MAC operation on a DSP. Experimental results show that when the input image resolution is 640*360 and the accelerator operating frequency is 350 MHz,the throughput of the accelerator reaches 551 FPS with 8.4 W of power consumption. All in all,this paper designed a streaming-based accelerator to effectively deploy embedded infrared target detection algorithm.

    The main contributions of this paper are summarized as follows:

    • This paper implemented a pipeline style infrared target detection accelerator with high throughput using high level synthesis.

    • This paper realized 2× MAC in a single DSP on FPGA.

    • This paper significantly reduced the power of accelerator by using software and hardware co-optimization.

    The remaining part of the paper proceeds as follows:The next section introduces the target network of acceleration briefly. Section II gives a basic introduction to the concept of high-level synthesis for hardware design. Section III describes our approach to hardware design in detail and the optimization method of the accelerator is delivered. Section IV is the experimental results and analysis. Section V summarizes our work.

    1 Proposed object detection on thermal images

    Although convolutional neural network algorithm is widely used in the field of target detection and has achieved good results,however,to pursue accuracy,current neural network models always comes with a large quantity of layers and parameters. These networks become increasingly unsuitable for the deployment of edge devices. Lightweight network,which aims to reduce the number and complexity of model parameters while maintaining model accuracy,has gradually become the focused research in computer vision.

    Net nameResNet-18ResNet-34ResNet-50VGG-16SkyNet
    Parameter11.18 M21.28 M23.51 M14.71 M0.44 M
    IoU0.610.260.320.250.73

    Table 1. SkyNet parameters and performance comparison with the classical network on DAC-SDC dataset

    As shown in Fig. 1,we set the backbone network of the accelerator to SkyNet14. SkyNet is a hardware-friendly and lightweight neural network for target detection and is the 2019 DAC-SDC champion model. In this model,the depthwise convolution15 is used to replace the traditional convolution,which significantly reduces the computation and parameter of the model. In order to detect small targets,the bypass provides more low-level and high-resolution features to improve the target detection effect. SkyNet also uses YOLO's detection head and two anchors to bound box regression.

    The network structure of infrared target detection algorithm based on deep learning

    Figure 1.The network structure of infrared target detection algorithm based on deep learning

    Network designed in top-down flow tends to have more layers and parameters. More parameters do not necessarily lead to better performance in a particular dataset. Table 1 compares IoU precision and parameter size between several classical networks and SkyNet in the DAC-SDC dataset. It can be found that the performance of network object detection with large parameters such as ResNet and VGG lags behind SkyNet in the DAC-SDC dataset. This situation shows that neural network is not just a simple algorithm that depends on large parameters,and it is possible to achieve object detection efficiently in the embedded device.

    SkyNet adopts the same detection method as YOLO for object detection. Firstly,the images of the input model were divided into 20×40 grids,and each grid predicted objects centered on the grid according to the anchor. In order to reduce calculation amount,this paper set the number of grids to two. Anchor with a fixed size is generally adopted in Fast-RCNN,but the method may not suit all objects with different sizes. In order to improve the training accuracy,a clustering algorithm is used to select anchors according to the dataset.

    To simplify the hardware design,we used ReLU instead of ReLU6. The FLIR dataset is used for training,which contains 14,452 infrared images,including people,bicycles,cars,etc. and annotated with MSCOCO dataset format. We train 100 epochs on the training set with batch size 30. The initial learning rate is 1e-2,and the IoU reaches 73.6%. We quantify the network weight to 5 bits and the feature map to 8 bits,and the final IoU is 72.3%. The object detection results on the infrared dataset are shown in Fig. 2. SkyNet has a good performance on infrared datasets. It can meet the requirement of embedded device target detection.

    SkyNet object detection result on FLIR dataset

    Figure 2.SkyNet object detection result on FLIR dataset

    2 HLS preliminaries

    Before we present our hardware design,we first review some basic concepts of high-level synthesis(HLS)design. HLS simplifies the development process of traditional hardware and uses C/C++ language to achieve the hardware design and development completed by traditional RTL.

    As shown in Fig. 3,we use initial interval and latency to describe the performance of a hardware module. Latency is defined as the number of clock cycles required for the function to compute all output values. The smaller the number of clocks cycles,the better the hardware performs. To achieve the goal of reaching a smaller latency,more resources,sometimes intolerable resources are consumed. Initial interval(II)is defined as the number of clock cycles before the function can accept new input data. The little II is,the higher the throughput of the hardware is,the more circuit results can be obtained at the same time. However,this requires designing data path skillfully to achieve a pipelined circuit structure.

    Concepts of initial interval and latency

    Figure 3.Concepts of initial interval and latency

    Take the neural network accelerator as an example. Without parallel design,each network layer can only run sequentially. In the whole design,latency and initial interval are equal. Each forward propagation can only wait to complete the last calculation before the module can start the consecutive calculation. In the process of waiting for the end of other stages,multiple computing units are idle,resulting in declining of performance degradation and computational efficiency.

    When the pipeline is used for optimization,the circuit takes less time to receive new data and completes more tasks per clock cycle. Each stage of the pipeline is relatively independent. As long as there is uncalculated data in each stage,the circuit will continue to work. Balanced pipeline significantly improves the efficiency of computation so that the processing speed is greatly improved when dealing with continuous tasks.

    The deep learning accelerator divides the pipeline by layers. Due to the massive difference in the workload between different layers,there is a vast distinction of calculation amount in different stages of simple pipeline implementation. In order to solve this problem,we can place corresponding computing resources according to the amount of calculation in each layer so that the clock cycles of each pipeline stage are similar. As shown in Fig. 4,a balanced pipeline design makes the calculation efficiency of the circuit higher and reduces the idle time of the computing unit.

    The accelerator design for balancing all stages of pipeline

    Figure 4.The accelerator design for balancing all stages of pipeline

    3 Hardware implementation

    3.1 Overview of accelerator architecture

    FPGA inference accelerator architecture

    Figure 5.FPGA inference accelerator architecture

    The accelerator is divided into two parts. CPU is in charge of reading image files,and FPGA is for realizing the deep learning accelerator with high parallelism. The data is exchanged between CPU and FPGA through the DMA module. The deep learning accelerator adopts pipeline structure,which is mainly composed of three parts:image pre-processing,convolution and max-pooling module. Divide the pipeline into stages with each convolution or max-pooling operation. Each convolutional or max-pooling layer is a computing unit,using FIFOs as interconnection. The neural network weights are stored in the on-chip block RAMs,and all the computing units work in parallel simultaneously to maximize resource utilization.

    3.2 Balanced pipeline

    The slowest stage determines the throughput of pipeline-style circuits. For pipeline-style circuits,the unbalanced pipeline will lead to the bubbles in the pipeline,leading to failure of computing units working at full speed. In other words,an unbalanced pipeline stage between different compute units indirectly leads to the waste of resources on the chip. The computation amount of different neural network layers is significantly different,making it very important to balance.

    After dividing each convolution or max-pooling operation into a pipeline stage,we can approximate the complexity of hardware according to the number of input and output feature map channels,and the size of the convolution kernel.

    The feature map is the intermediate result of operators at the convolution layer or max-pooling layer. We use W to represent the width of the feature map,H to represent the height of the feature map,Cin to represent the number of input channels,Cout to represent the number of output channels of the feature maps,and K to represent the size of the convolution kernel. When the stride is equal to 1,the computation amount of the depthwise convolution operation is

    TDWC=H×W×K×K×Cin .

    The computation amount of the pointwise convolution operation is

    TPWC=H×W×Cin×Cout .

    We use the number of multiply-accumulate operations(MAC)to represent the computation of the convolution layer in the deep learning algorithm. #MACn is used to indicate the amount of computation at the nth layer. The parallelism is used to measure the multiply-accumulate operations that each clock can complete,and the parallelism PFn represents the amount of computation at the nth layer. Throughput is used to evaluate the number of operator operations that can be done per second. Tn is used to represent the throughput of the nth layer. Assuming that the clock frequency running on the hardware is fclkTn can be expressed as

    Tn=PFN#MACn×fclk .

    It is an upper bound on the reachable throughput of each accelerator operator. PFaccelis used to represent the parallelism of the accelerator,and is defined as:

    PFaccel=max PFi,i1,2,,N .

    Assuming that the accelerator has only one clock domain and using the condition of pipeline balance T1=T2==TN,the theoretical parallelism of the nth layer can be obtained by the following equation:

    PFn=#MACnmax #MACi×PFaccel,i1,2,,N .

    When the result is a fraction,it is necessary to round up to obtain the parallelism of integers,thus simplifying the design of accelerators. Rounding up introduces only minor pipeline mismatches,and the loss of resource utilization is negligible,having no impact on performance.

    3.3 Accelerator design based on high level synthesis

    Datapath needs to be carefully designed for the hardware to deploy deep learning algorithms on resource-limited embedded devices effectively. In order to make full use of on-chip resources,all weights are stored on the chip,and due to the bandwidth of on-chip memory,the time of weight accessing can be ignored. In order to minimize data caching buffer size between neural network layers,our design reuses feature map as much as possible. All intermediate data is channel continuous as it passes. We introduce the realization of each operator respectively below and assuming that the size of the output feature map of the previous layer is H×W×N_ICH.

    3.3.1 Pointwise convolution datapath design

    Fig. 6 shows the design of pointwise convolution data path. We divide the feature map into many sliding cubes with sizes of 1×1×N_IN. Each time computing unit multiplies and accumulates with one cube in the feature map and N_IN weights of N_OUT channels. It is equivalent to having the parallelism of the accelerator set to N_IN× N_OUT. PW unit takes N_ICH/N_IN clock cycles to compute the results of N_OUT consecutive channels and passes them to the next layer. PW unit takes N_OCH × H × W / N_OUT computationsto get the result of PW convolution. The pseudocode for PW convolution is as follows:

    Datapath of pointwise convolution

    Figure 6.Datapath of pointwise convolution

    LayerTypeKCFM#MACPF
    Total465100800764
    1DW33160×32013824003
    2PW148160×320737280012
    3DW34880×160552960012
    4PW19680×1605898240096
    5DW39640×8027648006
    6PW119240×805898240096
    7DW319220×4027648003
    8PW138420×405898240096
    9DW338420×4027648006
    10PW151220×40157286400256
    11DW3128020×40921600016
    12PW19620×4098304000160
    13PW11020×407680002

    Table 2. Skynet’s parallelism factors of each layer

    3.3.2 Depthwise convolution datapath design

    As is shown in Fig. 7 and Fig. 8,the calculation of depthwise is divided into two steps. First,line buffer and window buffer are used to reduce the bottleneck of memory access,and the feature map within the range of 3×3 convolution kernel size on N_IO channels is converted into data flow with N_IO×BIT_WIDTH and depth of 9.

    Datapath of depthwise convolution

    Figure 7.Datapath of depthwise convolution

    Using line buffer to optimize datapath

    Figure 8.Using line buffer to optimize datapath

    The DW unit gets one-weight data from memory with data width of N_IO×WEIGHT_WIDTH and depth of 9 and then multiplies the feature map with the pre-ordered weights in the DW unit to obtain the N_IO channels result. The result of DW convolution can be obtained by N_OCH × H × W / N_IO times computation. The pseudocode of DW Conv is as follows:

    iSmartBJUT RunnerSkrSkrOur work
    ModelSkyNetUltraNetSkyNetSkyNet
    # of MACs465M272M465M465M
    # of PFs256448512764
    Frequency(MHz)220166333350
    BRAMs209150.5209206.5
    DSPs329360329360
    LUTs53809446335287550518
    FFs55833588135527840488
    Precision(W,A)11,94,46,85,8
    IoU73.1%65.6%73.1%72.3%
    Throughput(FPS)2521352551
    Power(W)13.56.666.78.4
    Energy(mJ/img)5403112815.2

    Table 3. Comparison with DAC-SDC accelerator design

    3.3.3 Maxpool datapath design

    The design of the pipeline structure of max-pooling is shown in the Fig. 9. In this design,the stride of max-pooling is 2,and the operation is performed in two main steps,pool1D and pool2D. Pool1D is in charge of maximum pooling in the horizontal direction,comparing every two received data(including N_CH channels)and outputting one data. Pool2D performs the maximum of vertical pooling. Assuming that the number of rows in the feature map starts at 0,Pool2D first caches data from even rows and outputs the final result after receiving one of the odd rows. Max-pooling module works in parallel with multiple channels,and the amount of parallelism depends on the number of PW’s output channels.

    Datapath of maxpool

    Figure 9.Datapath of maxpool

    3.4 Double MACs

    Table 2 shows that convolution requires a lot of multiplication and addition operations. MAC relies on the limited DSP core of the FPGA,which provides better performance and lower power consumption. Although the FPGA can use LUTs to realize multipliers,it also causes timing issue. In order to improve the performance of the system on the limited computing resources and embedded devices,we implement a high-performance DSP reuse technology.

    LayerTypeKCFM#MACPF
    Total465100800764
    1DW33160×32013824003
    2PW148160×320737280012
    3DW34880×160552960012
    4PW19680×1605898240096
    5DW39640×8027648006
    6PW119240×805898240096
    7DW319220×4027648003
    8PW138420×405898240096
    9DW338420×4027648006
    10PW151220×40157286400256
    11DW3128020×40921600016
    12PW19620×4098304000160
    13PW11020×407680002

    Table 2. Skynet’s parallelism factors of each layer

    The DSP slice DSP48E2 in Xilinx FPGA is capable of completing a 27×18 bit width multiplication operation and 48-bit accumulation operation. Our network’s feature map and weights are quantized to low bits(8 and 5 bits),making it possible for a DSP slice to perform two low-width MACs in a single stage. We can use a high-width multiplication to take place of two low-width multiplications and split the result into two numbers.

    Fig. 10 shows the internal structure of Xilinx’s DSP48E2 slice. Two different weights are input from port A and port D into the pre-adder which concatenates the two weights. The feature map is input to the multiplier from port B,and the result of the feature map and the spliced data after the multiplication operation is sent to the accumulator. If there is data to be accumulated,the data is input to the accumulator through port C.

    DSP48E2 slice architecture

    Figure 10.DSP48E2 slice architecture

    DSP units used to complete vector MACs and the calculation result of each unit is transmitted to the port C of the next DSP. The vector I0 and two vectors W0 and W1 align through delay and finally input to the PE. W1 needs sign extension and left shifting. The output Y can be expressed as

    Y=(W114+W0)×I0 .

    The addition and multiplication are implemented by the pre-adder and multiplier in the DSP block,respectively. The cascaded adder implements the accumulation. Reorganize(6)to get

    Y=W1×I014+W0×I0=O114+O0,

    O0 and O1 are the dots product results. Assuming that O0 and O1 have a bit width of 13,and O0 is equal to the low part of Y,while O1 will be affected by the sign bit of O0 and needs to be corrected.

    O0=Y12:0,
    O1=Y26:14+Y13 .

    Only a 13 bit carry chain is needed to implement the result correction. Our design requires a small amount of extra logic while doubling the DSP utilization.

    Fig. 11 shows how the PE array in pointwise convolution works. The yellow line represents the feature map passed in from the previous layer,and the green line is the input weight of parameter. With 2× MACs,both weights enter a PE simultaneously.

    Datapath of process element array

    Figure 11.Datapath of process element array

    3.5 System optimization

    Fig. 12 is the optimal method for the whole system test. The test picture data is saved on the SD card. In order to optimize the bottleneck of image reading,multithreading is used to reduce the delay of image reading on the Linux operating system,and the libjpeg-turbo library is used to speed up image decoding. Ping-pong buffer reduces the system bottleneck when the accelerator on FPGA communicates with the programmable system.

    System optimization

    Figure 12.System optimization

    4 Experiment and analysis

    4.1 Experimental setup

    We deploy the deep learning accelerator on the Ultra96v2 evaluation board based on the SkyNet network. Ultra96v2 evaluation board is equipped with a Xilinx Zynq Ultrascale+MPSoC ZU3EG hybrid chip,mainly composed of CPU and FPGA. The logical part of the chip is 16 nm FPGA,which has 7.6 Mb BRAMs,360 DSPs,and 70.6 K LUTs. The CPU part contains quad ARM Cortex-A53 CPU. Use the IP driver provided by the PYNQ framework to control the accelerator. As is shown in Table 2K represents the size of the kernel,C represents the number of output channels at this layer,FM represents the size of the feature map,#MAC represents the amount of computation,PF represents the parallel factor. SkyNet consists of six depthwise convolutions and seven pointwise convolutions. The 10th convolution layer has the most extensive computation,and its parallelism is set to 256. The parallelism of other layers is set according to Eq.(5).

    Part of the parallel factor is rounded up to simplify the design of the accelerator. The specific settings are shown in the Table 2. Thanks to the weights of the original neural network indicated by floating-point being quantized to 5 bits and the activation to 8 bits,and the weight size of the entire network is reduced to 277 KB,making it possible to store all the weight data on the chip. The intersection over union(IoU)of the original network can reach an accuracy of 73.6% on the same data set,while the quantized model is reduced by 1.3% to 72.3%. Our accelerator consumes 206.5 block RAMs(BRAM),360 DSPs,50518 look-up-tables(LUT),and 40488 flip-flops(FF). The accelerator has no timing issue,and the design can run at 350 MHz clock frequency.

    4.2 Accelerator throughput analysis

    The neural network SkyNet’s forward propagation requires 465 MMAC operations. Our accelerator is designed to do 764 multiplications per clock cycle. When the accelerator runs at a 350 MHz clock frequency,the theoretical throughput should be 764×350 MHz = 267.4 GMACs. The actual test results of the accelerator in Table 3 show a frame rate of 551 FPS and a throughput of 256.2 GMACs. The computational efficiency of the accelerator is 256.2 / 267.4 = 95.8%. Such high computational efficiency mainly benefits from the balanced pipeline structure and the ultra-high bandwidth of on-chip memory. Overall,our accelerator achieves a computational efficiency of 95.8% at 350 MHz.

    iSmartBJUT RunnerSkrSkrOur work
    ModelSkyNetUltraNetSkyNetSkyNet
    # of MACs465M272M465M465M
    # of PFs256448512764
    Frequency(MHz)220166333350
    BRAMs209150.5209206.5
    DSPs329360329360
    LUTs53809446335287550518
    FFs55833588135527840488
    Precision(W,A)11,94,46,85,8
    IoU73.1%65.6%73.1%72.3%
    Throughput(FPS)2521352551
    Power(W)13.56.666.78.4
    Energy(mJ/img)5403112815.2

    Table 3. Comparison with DAC-SDC accelerator design

    4.3 Comparison with other works

    Comparing our design with the design of the previous design automation conference-system design contest(DAC-SDC)winners in Table 3,we can find the superiority of the accelerator design in this paper. The experimental data are from the official website of DAC-SDC. BJUT Runner is the champion of DAC-SDC 2020. They also adopted the pipeline-style accelerator design,but the unreasonable pipeline-style setting of their accelerator made it unable to give full play to the advantages of pipeline-style. Our work is 2.59× better in performance and 2.04× better in energy efficiency than theirs. The computational efficiency was only 77.8% with theoretical throughput of 448×166 MHz = 74.4 GMACs/s and actual throughput of 57.9 GMACs/s.

    SkrSkr was the second place in DAC-SDC 2020. Their design used a convolution computing engine with very high parallelism in FPGA to calculate one convolution of a fixed size each time. The neural network layers were calculated sequentially,and the intermediate results and the weight of the neural network were passed repeatedly between the on-chip memory and DDR,resulting in performance degradation. Our design has a 10.6× throughput and 8.4× power consumption improvement over SkrSkr.

    5 Conclusion

    This paper presents a low-power convolutional neural network accelerator for infrared target detection on embedded devices. A design method to minimize hardware resource consumption is proposed for this hardware structure. The accelerator achieves 249.7 GMACs throughput on the Xilinx ZU3EG device and achieves the best computational efficiency and energy efficiency compared to previous works. This low-powered deep learning FPGA accelerator has high practical value.

    References

    [1] A Nowosielski, K Małecki, P Forczmański et al. Embedded Night-Vision System for Pedestrian Detection. IEEE Sensors Journal, 20, 9293-9304(2020).

    [2] M F A Mushahar, N Zaini. 2021 11th IEEE International Conference on Control System. Computing and Engineering (ICCSCE), 222-227(2021).

    [3] A Akula, R Ghosh, S Kumar et al. WignerMSER: Pseudo-Wigner Distribution Enriched MSER Feature Detector for Object Recognition in Thermal Infrared Images. IEEE Sensors Journal, 19, 4221-4228(2019).

    [4] D Wu, J Wang, W Liu et al. 2017 First International Conference on Electronics Instrumentation & Information Systems (EIIS), 1-4(2017).

    [5] K Piniarski, P Pawłowski. 2017 Signal Processing: Algorithms. Architectures, Arrangements, and Applications (SPA), 160-165(2017).

    [6] H K Ragb, T H Aspiras, V K Asari. 2018 IEEE International Symposium on Technologies for Homeland Security (HST), 1-7(2018).

    [7] S Li, Y Li, Y Li et al. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access, 9, 141861-141875(2021).

    [8] S Yun, S Kim. 2019 19th International Conference on Control. Automation and Systems (ICCAS), 94-96(2019).

    [9] A Narayanan, R D Kumar, R RoselinKiruba et al. 2021 Sixth International Conference on Wireless Communications. Signal Processing and Networking (WiSPNET), 431-434(2021).

    [10] J Huang, J Yang, S Nui et al. 2021 China Semiconductor Technology International Conference (CSTIC), 1-3(2021).

    [11] H J Kang. 2019 International Conference on Field-Programmable Technology (ICFPT), 419-422(2019).

    [12] Y Li, S Lu, J Luo et al. 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), 335-339(2019).

    [13] S Lee, D Kim, D Nguyen et al. Double MAC on a DSP: Boosting the Performance of Convolutional Neural Networks on FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38, 888-897(2019).

    [14] X Zhang, H Lu, C Hao et al. SkyNet: a hardware-efficient method for object detection and tracking on embedded systems. Proceedings of Machine Learning and Systems, 2, 216-229(2020).

    [15] A G Howard, M Zhu, B Chen et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications(2017).

    Shao-Yi CHEN, Xin-Yi TANG, Jian WANG, Jing-Si HUANG, Zheng LI. An ultra-efficient streaming-based FPGA accelerator for infrared target detection[J]. Journal of Infrared and Millimeter Waves, 2022, 41(5): 914
    Download Citation