Shao-Yi CHEN1,2,3,4, Xin-Yi TANG2,3,4, Jian WANG2,3,4, Jing-Si HUANG1,2,3,4, and Zheng LI2,3,4,*
Author Affiliations
1School of Information Science and Technology,Shanghai Tech University,Shanghai 201210,China2Shanghai Institute of Technical Physics,Chinese Academy of Sciences,Shanghai 20083,China3University of Chinese Academy of Sciences,Beijing 100049,China4Key Laboratory of Infrared System Detection and Imaging Technology,Chinese Academy of Sciences,Shanghai 200083,Chinashow less
Fig. 1. The network structure of infrared target detection algorithm based on deep learning
Fig. 2. SkyNet object detection result on FLIR dataset
Fig. 3. Concepts of initial interval and latency
Fig. 4. The accelerator design for balancing all stages of pipeline
Fig. 5. FPGA inference accelerator architecture
Fig. 6. Datapath of pointwise convolution
Fig. 7. Datapath of depthwise convolution
Fig. 8. Using line buffer to optimize datapath
Fig. 9. Datapath of maxpool
Fig. 10. DSP48E2 slice architecture
Fig. 11. Datapath of process element array
Fig. 12. System optimization
Algorithm 1Pseudocode for Pointwise Convolution Layer |
---|
Input: in× BIT_IN>:feature map input | weight< N_OUT × N_IN × BIT_WT>[N_OCH / N_OUT][N_ICH / N_IN]:weight of neural network | N_IN:number of input parallel factor | N_OUT:number of output parallel factor | N_ICH:number of input channel | N_OCH:number of output channel | BIT_IN:bitwidth of input | BIT_WT:bitwidth of weight | BIT_OUT:bitwidth of output | #pragma HLS DATAFLOW | forfo = 0;fofo do | forfi = 0;fifi do | #pragma HLS PIPELINE II=1 | fori = 0;ii do | #pragma HLS UNROLL | foro = 0;oo do | out += in * weight[fo][fi]; | end for | end for | end for | end for | Output:out:feature map output |
|
Table 0. [in Chinese]
Algorithm 2Pseudocode for Depthwise Convolution Layer |
---|
Input: in:feature map input | weight [N_CH / N_IO][9]:weight of neural network | N_IO:number of input parallel factor | N_CH:number of input channel | BIT_IN:bitwidth of input | BIT_WT:bitwidth of weight | BIT_OUT:bitwidth of output | #pragma HLS DATAFLOW | forf = 0;f < N_CH / N_IO;++f do | fork = 0;k<9;++k do | #pragma HLS PIPELINE II=1 | wt_buf = weight[f][k] | fori = 0;ii do | #pragma HLS UNROLL | foro = 0;oo do | out += in * wt_buf ; | end for | end for | end for | end for | Output:out:feature map output |
|
Table 0. [in Chinese]
Net name | ResNet-18 | ResNet-34 | ResNet-50 | VGG-16 | SkyNet |
---|
Parameter | 11.18 M | 21.28 M | 23.51 M | 14.71 M | 0.44 M | IoU | 0.61 | 0.26 | 0.32 | 0.25 | 0.73 |
|
Table 1. SkyNet parameters and performance comparison with the classical network on DAC-SDC dataset
Layer | Type | K | C | FM | #MAC | PF |
---|
Total | | | | | 465100800 | 764 | 1 | DW | 3 | 3 | 160×320 | 1382400 | 3 | 2 | PW | 1 | 48 | 160×320 | 7372800 | 12 | 3 | DW | 3 | 48 | 80×160 | 5529600 | 12 | 4 | PW | 1 | 96 | 80×160 | 58982400 | 96 | 5 | DW | 3 | 96 | 40×80 | 2764800 | 6 | 6 | PW | 1 | 192 | 40×80 | 58982400 | 96 | 7 | DW | 3 | 192 | 20×40 | 2764800 | 3 | 8 | PW | 1 | 384 | 20×40 | 58982400 | 96 | 9 | DW | 3 | 384 | 20×40 | 2764800 | 6 | 10 | PW | 1 | 512 | 20×40 | 157286400 | 256 | 11 | DW | 3 | 1280 | 20×40 | 9216000 | 16 | 12 | PW | 1 | 96 | 20×40 | 98304000 | 160 | 13 | PW | 1 | 10 | 20×40 | 768000 | 2 |
|
Table 2. Skynet’s parallelism factors of each layer
| iSmart | BJUT Runner | SkrSkr | Our work |
---|
Model | SkyNet | UltraNet | SkyNet | SkyNet | # of MACs | 465M | 272M | 465M | 465M | # of PFs | 256 | 448 | 512 | 764 | Frequency(MHz) | 220 | 166 | 333 | 350 | BRAMs | 209 | 150.5 | 209 | 206.5 | DSPs | 329 | 360 | 329 | 360 | LUTs | 53809 | 44633 | 52875 | 50518 | FFs | 55833 | 58813 | 55278 | 40488 | Precision(W,A) | 11,9 | 4,4 | 6,8 | 5,8 | IoU | 73.1% | 65.6% | 73.1% | 72.3% | Throughput(FPS) | 25 | 213 | 52 | 551 | Power(W) | 13.5 | 6.66 | 6.7 | 8.4 | Energy(mJ/img) | 540 | 31 | 128 | 15.2 |
|
Table 3. Comparison with DAC-SDC accelerator design