• Journal of Semiconductors
  • Vol. 41, Issue 2, 022403 (2020)
Cheng Luo1, Man-Kit Sit2, Hongxiang Fan2, Shuanglong Liu2, Wayne Luk2, and Ce Guo2
Author Affiliations
  • 1State Key Laboratory of ASIC and System, Fudan University, Shanghai 200050, China
  • 2Department of Computing, Imperial College London, London, United Kingdom
  • show less
    DOI: 10.1088/1674-4926/41/2/022403 Cite this Article
    Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403 Copy Citation Text show less
    A overview of inference and training processes on the convolutional layer.
    Fig. 1. A overview of inference and training processes on the convolutional layer.
    (Color online) Comparison of BCHW and CHWB patterns.
    Fig. 2. (Color online) Comparison of BCHW and CHWB patterns.
    (Color online) The tiling flow for convolution.
    Fig. 3. (Color online) The tiling flow for convolution.
    (Color online) System overview.
    Fig. 4. (Color online) System overview.
    (Color online) Hardware architecture of GEMM kernel.
    Fig. 5. (Color online) Hardware architecture of GEMM kernel.
    (Color online) Input double buffer supporting matrix transposition.
    Fig. 6. (Color online) Input double buffer supporting matrix transposition.
    (Color online) The DarkFPGA framework.
    Fig. 7. (Color online) The DarkFPGA framework.
    (Color online) Performance and resource consumption experiments under different design space using int8 weights. (a) Computational time. (b) Performance evaluation. (c) Resource consumption.
    Fig. 8. (Color online) Performance and resource consumption experiments under different design space using int8 weights. (a) Computational time. (b) Performance evaluation. (c) Resource consumption.
    (Color online) Performance comparisons between homogeneous system and heterogeneous system.
    Fig. 9. (Color online) Performance comparisons between homogeneous system and heterogeneous system.
    ParameterDescription
    the batch size of training examples
    the size of channel
    the size of filter
    the kernel size of weights
    the height of frames
    the width of frames
    Table 1. Parameters for FPGA training.
    LayerBCFH × WK
    CONV1128312832 × 323 × 3
    CONV212812812832 × 323 × 3
    MAXPOOLING12812812816 × 162 × 2
    CONV312812825616 × 163 × 3
    CONV412825625616 × 163 × 3
    MAXPOOLING1282562568 × 82 × 2
    CONV51282565128 × 83 × 3
    CONV61285125128 × 83 × 3
    MAXPOOLING1285125124 × 42 × 2
    FC12880961024
    FC128102410
    SSE1281010
    Table 2. The network architecture in experiment.
    ParameterCPUGPUDarkFPGA
    PlatformIntel Xeon X5690GTX 1080 TiMAX5 Platform
    No. of cores63584
    CompilerGCC 5.4.0CUDA 9.0Maxcompiler 2019.2
    Flag-Ofast
    Frequency (GHz)3.471.580.2
    Precision32-bit floating point32-bit floating point8-bit fixed point
    Technology (nm)322816
    Processing time per batch (ms)66439 (3270)126 (53.4)331
    Threads1 (24)
    Power (W)131 (204)187 (217)13.5
    Energy (J)8712 (667.1)23.6 (11.6)4.5
    Energy efficiency1x (13x)369x (751x)1936x
    Table 3. Performance comparison among FPGA, CPU and GPU.
    AcceleratorPlatformConfigModel datasetLUTs (kW)DSPs efficiencyPerformance (GOPS)Throughput (image/s)
    (1) '−' means this metrics is not provided on their papers, '≈' indicate that this value is obtained by approximate estimates. (2) The accelerator from Ref. [29] didn't compute the gradients for training. (3) The power consumption of Ref. [29] measured from entire development board when our power consumption is measured from single FPGA chip.
    F-CNN[29]FCCM 16 Altera Stratix V8 FPGALeNet-5 MNIST7
    FPDeep[31]FCCM 18 Virtex7 VC70910 FPGAAlexNet imagenet≈ 460 per FPGA≈ 2880 per FPGA≈ 1022 per FPGA
    DiCecco et al.[33]FPGA 18 Xilinx XCKU1151 FPGALeNet-like CIFAR10≈ 530≈ 883522
    Nakahara et al.[34]FPL 19 UltraScale+ XCVU9P1 FPGAVGG16 CIFAR1093411064878
    Sean et al.[35]FPT 19 Zynq ZCU1111 FPGAVGG-16 CIFAR1073.110373.3
    DarkFPGAUltraScale+ XCVU9P1 FPGAVGG-like CIFAR1048042021417386.7
    Table 4. Performance comparison of different FPGA-based training accelerators.
    Algorithm 1: Pseudocode for training convolutional layers
    1 Forward propagation:
    2 forb = 1 toBdo
    3 forc = 1 toC× Kdo
    4 forf = 1 toFdo
    5 forim = 1 toH *Wdo
    6 Al+1[b][f][im] += Wl[f][c] *Al[b][c][im]
    7 Backward propagation:
    8 forb = 1 toBdo
    9 forc = 1 toC× Kdo
    10 forf = 1 toFdo
    11 forim = 1 toH *Wdo
    12 El[b][c][im] += Wl[f][c] *El+1[b][f][im]
    13 Gradient Generation:
    14 forb = 1 toBdo
    15 forc = 1 toC× Kdo
    16 forf = 1 toFdo
    17 forim = 1 toH *Wdo
    18 Gl[b][f][c] += Al[b][c][im] *El+1[b][f][im]
    Table 5. [in Chinese]
    Algorithm 2: Pseudocode of tiled matrix multiplication
    1 Consider that the weight matrix and gradient matrix are transferred into TI × TI tiled blocks, where input frames, output frames and error frames are transferred as 3-dimensions TB × TI × TI tiled blocks. In particular, the input frames and output frames of fully-connected layer are transferred as 2-dimensions TB × TI tiled blocks
    2Convolutional forward propagation:
    3forf = 1 toF/TIdo
    4forim = 1 to (H * W)/TIdo
    5forb = 1 toB/TBdo
    6forc = 1 toC× K/TIdo
    7Al+1(b)(f, im)c = Wl(f, c) × Al(b)(c, im)
    8Al+1(b)(f, im) + = Al+1(b)(f, im)c
    9 Quantize (Al+1(b)(f, im))
    10 Output Al+1(b)(f, im)
    11Convolutional backward propagation:
    12forc = 1 todo
    13forim = 1 todo
    14forb = 1 todo
    15forf = 1 todo
    16
    17
    18 Output
    19Convolutional gradients generations:
    20forf = 1 todo
    21forc = 1 todo
    22forb = 1 todo
    23forim = 1 todo
    24
    25
    26
    27 Output
    28Fully-connected forward propagation:
    29forf = 1 todo
    30forb = 1 todo
    31forc = 1 todo
    32
    33
    34 Output
    Table 6. [in Chinese]
    Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403
    Download Citation