
- Journal of Infrared and Millimeter Waves
- Vol. 41, Issue 5, 914 (2022)
Abstract
Keywords
Introduction
Infrared systems have the unique advantage of resolving objects in dark environment or bad weather. Nowosielski et al. [
Although the infrared image has many advantages,it lacks contrast and edge information compared with the visible image. Akula et al. [
Deep learning shows more robust performance than traditional algorithms in the field of target detection and is widely used in the field of infrared target detection. Li et al.[
Neural network algorithm has achieved good results in target detection,but there is huge challenge in embedded deployment. To deploy neural network algorithms on embedded devices usually has many limitations,including real-time,limited computing units,and restricted on-chip memory. Huang et al. [
FPGA can generate the corresponding structure accelerator according to the low precision weight. This feature gives full play to all the advantages of the design. There have been many studies on deep learning accelerators based on FPGA. Lee et al. [
In order to solve this problem,we first put forward a lightweight neural network algorithm for infrared target detection. Then,we use high-level synthesis(HLS)to implement the convolutional neural network accelerator and deploy this algorithm on the Ultra96v2 development board. Finally,according to the characteristics of the FPGA structure,we realize the 2× MAC operation on a DSP. Experimental results show that when the input image resolution is 640*360 and the accelerator operating frequency is 350 MHz,the throughput of the accelerator reaches 551 FPS with 8.4 W of power consumption. All in all,this paper designed a streaming-based accelerator to effectively deploy embedded infrared target detection algorithm.
The main contributions of this paper are summarized as follows:
• This paper implemented a pipeline style infrared target detection accelerator with high throughput using high level synthesis.
• This paper realized 2× MAC in a single DSP on FPGA.
• This paper significantly reduced the power of accelerator by using software and hardware co-optimization.
The remaining part of the paper proceeds as follows:The next section introduces the target network of acceleration briefly. Section II gives a basic introduction to the concept of high-level synthesis for hardware design. Section III describes our approach to hardware design in detail and the optimization method of the accelerator is delivered. Section IV is the experimental results and analysis. Section V summarizes our work.
1 Proposed object detection on thermal images
Although convolutional neural network algorithm is widely used in the field of target detection and has achieved good results,however,to pursue accuracy,current neural network models always comes with a large quantity of layers and parameters. These networks become increasingly unsuitable for the deployment of edge devices. Lightweight network,which aims to reduce the number and complexity of model parameters while maintaining model accuracy,has gradually become the focused research in computer vision.
Net name | ResNet-18 | ResNet-34 | ResNet-50 | VGG-16 | SkyNet |
---|---|---|---|---|---|
Parameter | 11.18 M | 21.28 M | 23.51 M | 14.71 M | 0.44 M |
IoU | 0.61 | 0.26 | 0.32 | 0.25 | 0.73 |
Table 1. SkyNet parameters and performance comparison with the classical network on DAC-SDC dataset
As shown in
Figure 1.The network structure of infrared target detection algorithm based on deep learning
Network designed in top-down flow tends to have more layers and parameters. More parameters do not necessarily lead to better performance in a particular dataset.
SkyNet adopts the same detection method as YOLO for object detection. Firstly,the images of the input model were divided into 20×40 grids,and each grid predicted objects centered on the grid according to the anchor. In order to reduce calculation amount,this paper set the number of grids to two. Anchor with a fixed size is generally adopted in Fast-RCNN,but the method may not suit all objects with different sizes. In order to improve the training accuracy,a clustering algorithm is used to select anchors according to the dataset.
To simplify the hardware design,we used ReLU instead of ReLU6. The FLIR dataset is used for training,which contains 14,452 infrared images,including people,bicycles,cars,etc. and annotated with MSCOCO dataset format. We train 100 epochs on the training set with batch size 30. The initial learning rate is 1e-2,and the IoU reaches 73.6%. We quantify the network weight to 5 bits and the feature map to 8 bits,and the final IoU is 72.3%. The object detection results on the infrared dataset are shown in
Figure 2.SkyNet object detection result on FLIR dataset
2 HLS preliminaries
Before we present our hardware design,we first review some basic concepts of high-level synthesis(HLS)design. HLS simplifies the development process of traditional hardware and uses C/C++ language to achieve the hardware design and development completed by traditional RTL.
As shown in
Figure 3.Concepts of initial interval and latency
Take the neural network accelerator as an example. Without parallel design,each network layer can only run sequentially. In the whole design,latency and initial interval are equal. Each forward propagation can only wait to complete the last calculation before the module can start the consecutive calculation. In the process of waiting for the end of other stages,multiple computing units are idle,resulting in declining of performance degradation and computational efficiency.
When the pipeline is used for optimization,the circuit takes less time to receive new data and completes more tasks per clock cycle. Each stage of the pipeline is relatively independent. As long as there is uncalculated data in each stage,the circuit will continue to work. Balanced pipeline significantly improves the efficiency of computation so that the processing speed is greatly improved when dealing with continuous tasks.
The deep learning accelerator divides the pipeline by layers. Due to the massive difference in the workload between different layers,there is a vast distinction of calculation amount in different stages of simple pipeline implementation. In order to solve this problem,we can place corresponding computing resources according to the amount of calculation in each layer so that the clock cycles of each pipeline stage are similar. As shown in
Figure 4.The accelerator design for balancing all stages of pipeline
3 Hardware implementation
3.1 Overview of accelerator architecture
Figure 5.FPGA inference accelerator architecture
The accelerator is divided into two parts. CPU is in charge of reading image files,and FPGA is for realizing the deep learning accelerator with high parallelism. The data is exchanged between CPU and FPGA through the DMA module. The deep learning accelerator adopts pipeline structure,which is mainly composed of three parts:image pre-processing,convolution and max-pooling module. Divide the pipeline into stages with each convolution or max-pooling operation. Each convolutional or max-pooling layer is a computing unit,using FIFOs as interconnection. The neural network weights are stored in the on-chip block RAMs,and all the computing units work in parallel simultaneously to maximize resource utilization.
3.2 Balanced pipeline
The slowest stage determines the throughput of pipeline-style circuits. For pipeline-style circuits,the unbalanced pipeline will lead to the bubbles in the pipeline,leading to failure of computing units working at full speed. In other words,an unbalanced pipeline stage between different compute units indirectly leads to the waste of resources on the chip. The computation amount of different neural network layers is significantly different,making it very important to balance.
After dividing each convolution or max-pooling operation into a pipeline stage,we can approximate the complexity of hardware according to the number of input and output feature map channels,and the size of the convolution kernel.
The feature map is the intermediate result of operators at the convolution layer or max-pooling layer. We use W to represent the width of the feature map,H to represent the height of the feature map,Cin to represent the number of input channels,Cout to represent the number of output channels of the feature maps,and K to represent the size of the convolution kernel. When the stride is equal to 1,the computation amount of the depthwise convolution operation is
The computation amount of the pointwise convolution operation is
We use the number of multiply-accumulate operations(MAC)to represent the computation of the convolution layer in the deep learning algorithm. #MACn is used to indicate the amount of computation at the nth layer. The parallelism is used to measure the multiply-accumulate operations that each clock can complete,and the parallelism PFn represents the amount of computation at the nth layer. Throughput is used to evaluate the number of operator operations that can be done per second. Tn is used to represent the throughput of the nth layer. Assuming that the clock frequency running on the hardware is fclk,Tn can be expressed as
It is an upper bound on the reachable throughput of each accelerator operator. PFaccelis used to represent the parallelism of the accelerator,and is defined as:
Assuming that the accelerator has only one clock domain and using the condition of pipeline balance
When the result is a fraction,it is necessary to round up to obtain the parallelism of integers,thus simplifying the design of accelerators. Rounding up introduces only minor pipeline mismatches,and the loss of resource utilization is negligible,having no impact on performance.
3.3 Accelerator design based on high level synthesis
Datapath needs to be carefully designed for the hardware to deploy deep learning algorithms on resource-limited embedded devices effectively. In order to make full use of on-chip resources,all weights are stored on the chip,and due to the bandwidth of on-chip memory,the time of weight accessing can be ignored. In order to minimize data caching buffer size between neural network layers,our design reuses feature map as much as possible. All intermediate data is channel continuous as it passes. We introduce the realization of each operator respectively below and assuming that the size of the output feature map of the previous layer is
3.3.1 Pointwise convolution datapath design
Figure 6.Datapath of pointwise convolution
Layer | Type | K | C | FM | #MAC | PF |
---|---|---|---|---|---|---|
Total | 465100800 | 764 | ||||
1 | DW | 3 | 3 | 160×320 | 1382400 | 3 |
2 | PW | 1 | 48 | 160×320 | 7372800 | 12 |
3 | DW | 3 | 48 | 80×160 | 5529600 | 12 |
4 | PW | 1 | 96 | 80×160 | 58982400 | 96 |
5 | DW | 3 | 96 | 40×80 | 2764800 | 6 |
6 | PW | 1 | 192 | 40×80 | 58982400 | 96 |
7 | DW | 3 | 192 | 20×40 | 2764800 | 3 |
8 | PW | 1 | 384 | 20×40 | 58982400 | 96 |
9 | DW | 3 | 384 | 20×40 | 2764800 | 6 |
10 | PW | 1 | 512 | 20×40 | 157286400 | 256 |
11 | DW | 3 | 1280 | 20×40 | 9216000 | 16 |
12 | PW | 1 | 96 | 20×40 | 98304000 | 160 |
13 | PW | 1 | 10 | 20×40 | 768000 | 2 |
Table 2. Skynet’s parallelism factors of each layer
3.3.2 Depthwise convolution datapath design
As is shown in
Figure 7.Datapath of depthwise convolution
Figure 8.Using line buffer to optimize datapath
The DW unit gets one-weight data from memory with data width of N_IO×WEIGHT_WIDTH and depth of 9 and then multiplies the feature map with the pre-ordered weights in the DW unit to obtain the N_IO channels result. The result of DW convolution can be obtained by N_OCH × H × W / N_IO times computation. The pseudocode of DW Conv is as follows:
iSmart | BJUT Runner | SkrSkr | Our work | |
---|---|---|---|---|
Model | SkyNet | UltraNet | SkyNet | SkyNet |
# of MACs | 465M | 272M | 465M | 465M |
# of PFs | 256 | 448 | 512 | 764 |
Frequency(MHz) | 220 | 166 | 333 | 350 |
BRAMs | 209 | 150.5 | 209 | 206.5 |
DSPs | 329 | 360 | 329 | 360 |
LUTs | 53809 | 44633 | 52875 | 50518 |
FFs | 55833 | 58813 | 55278 | 40488 |
Precision(W,A) | 11,9 | 4,4 | 6,8 | 5,8 |
IoU | 73.1% | 65.6% | 73.1% | 72.3% |
Throughput(FPS) | 25 | 213 | 52 | 551 |
Power(W) | 13.5 | 6.66 | 6.7 | 8.4 |
Energy(mJ/img) | 540 | 31 | 128 | 15.2 |
Table 3. Comparison with DAC-SDC accelerator design
3.3.3 Maxpool datapath design
The design of the pipeline structure of max-pooling is shown in the
Figure 9.Datapath of maxpool
3.4 Double MACs
Layer | Type | K | C | FM | #MAC | PF |
---|---|---|---|---|---|---|
Total | 465100800 | 764 | ||||
1 | DW | 3 | 3 | 160×320 | 1382400 | 3 |
2 | PW | 1 | 48 | 160×320 | 7372800 | 12 |
3 | DW | 3 | 48 | 80×160 | 5529600 | 12 |
4 | PW | 1 | 96 | 80×160 | 58982400 | 96 |
5 | DW | 3 | 96 | 40×80 | 2764800 | 6 |
6 | PW | 1 | 192 | 40×80 | 58982400 | 96 |
7 | DW | 3 | 192 | 20×40 | 2764800 | 3 |
8 | PW | 1 | 384 | 20×40 | 58982400 | 96 |
9 | DW | 3 | 384 | 20×40 | 2764800 | 6 |
10 | PW | 1 | 512 | 20×40 | 157286400 | 256 |
11 | DW | 3 | 1280 | 20×40 | 9216000 | 16 |
12 | PW | 1 | 96 | 20×40 | 98304000 | 160 |
13 | PW | 1 | 10 | 20×40 | 768000 | 2 |
Table 2. Skynet’s parallelism factors of each layer
The DSP slice DSP48E2 in Xilinx FPGA is capable of completing a 27×18 bit width multiplication operation and 48-bit accumulation operation. Our network’s feature map and weights are quantized to low bits(8 and 5 bits),making it possible for a DSP slice to perform two low-width MACs in a single stage. We can use a high-width multiplication to take place of two low-width multiplications and split the result into two numbers.
Figure 10.DSP48E2 slice architecture
DSP units used to complete vector MACs and the calculation result of each unit is transmitted to the port C of the next DSP. The vector I0 and two vectors W0 and W1 align through delay and finally input to the PE. W1 needs sign extension and left shifting. The output Y can be expressed as
The addition and multiplication are implemented by the pre-adder and multiplier in the DSP block,respectively. The cascaded adder implements the accumulation. Reorganize(6)to get
O0 and O1 are the dots product results. Assuming that O0 and O1 have a bit width of 13,and O0 is equal to the low part of Y,while O1 will be affected by the sign bit of O0 and needs to be corrected.
Only a 13 bit carry chain is needed to implement the result correction. Our design requires a small amount of extra logic while doubling the DSP utilization.
Figure 11.Datapath of process element array
3.5 System optimization
Figure 12.System optimization
4 Experiment and analysis
4.1 Experimental setup
We deploy the deep learning accelerator on the Ultra96v2 evaluation board based on the SkyNet network. Ultra96v2 evaluation board is equipped with a Xilinx Zynq Ultrascale+MPSoC ZU3EG hybrid chip,mainly composed of CPU and FPGA. The logical part of the chip is 16 nm FPGA,which has 7.6 Mb BRAMs,360 DSPs,and 70.6 K LUTs. The CPU part contains quad ARM Cortex-A53 CPU. Use the IP driver provided by the PYNQ framework to control the accelerator. As is shown in
Part of the parallel factor is rounded up to simplify the design of the accelerator. The specific settings are shown in the
4.2 Accelerator throughput analysis
The neural network SkyNet’s forward propagation requires 465 MMAC operations. Our accelerator is designed to do 764 multiplications per clock cycle. When the accelerator runs at a 350 MHz clock frequency,the theoretical throughput should be 764×350 MHz = 267.4 GMACs. The actual test results of the accelerator in
iSmart | BJUT Runner | SkrSkr | Our work | |
---|---|---|---|---|
Model | SkyNet | UltraNet | SkyNet | SkyNet |
# of MACs | 465M | 272M | 465M | 465M |
# of PFs | 256 | 448 | 512 | 764 |
Frequency(MHz) | 220 | 166 | 333 | 350 |
BRAMs | 209 | 150.5 | 209 | 206.5 |
DSPs | 329 | 360 | 329 | 360 |
LUTs | 53809 | 44633 | 52875 | 50518 |
FFs | 55833 | 58813 | 55278 | 40488 |
Precision(W,A) | 11,9 | 4,4 | 6,8 | 5,8 |
IoU | 73.1% | 65.6% | 73.1% | 72.3% |
Throughput(FPS) | 25 | 213 | 52 | 551 |
Power(W) | 13.5 | 6.66 | 6.7 | 8.4 |
Energy(mJ/img) | 540 | 31 | 128 | 15.2 |
Table 3. Comparison with DAC-SDC accelerator design
4.3 Comparison with other works
Comparing our design with the design of the previous design automation conference-system design contest(DAC-SDC)winners in
SkrSkr was the second place in DAC-SDC 2020. Their design used a convolution computing engine with very high parallelism in FPGA to calculate one convolution of a fixed size each time. The neural network layers were calculated sequentially,and the intermediate results and the weight of the neural network were passed repeatedly between the on-chip memory and DDR,resulting in performance degradation. Our design has a 10.6× throughput and 8.4× power consumption improvement over SkrSkr.
5 Conclusion
This paper presents a low-power convolutional neural network accelerator for infrared target detection on embedded devices. A design method to minimize hardware resource consumption is proposed for this hardware structure. The accelerator achieves 249.7 GMACs throughput on the Xilinx ZU3EG device and achieves the best computational efficiency and energy efficiency compared to previous works. This low-powered deep learning FPGA accelerator has high practical value.
References
[1] A Nowosielski, K Małecki, P Forczmański et al. Embedded Night-Vision System for Pedestrian Detection. IEEE Sensors Journal, 20, 9293-9304(2020).
[12] Y Li, S Lu, J Luo et al. 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), 335-339(2019).
[14] X Zhang, H Lu, C Hao et al. SkyNet: a hardware-efficient method for object detection and tracking on embedded systems. Proceedings of Machine Learning and Systems, 2, 216-229(2020).
[15] A G Howard, M Zhu, B Chen et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications(2017).

Set citation alerts for the article
Please enter your email address