• Journal of Semiconductors
  • Vol. 41, Issue 2, 022403 (2020)
Cheng Luo1, Man-Kit Sit2, Hongxiang Fan2, Shuanglong Liu2..., Wayne Luk2 and Ce Guo2|Show fewer author(s)
Author Affiliations
  • 1State Key Laboratory of ASIC and System, Fudan University, Shanghai 200050, China
  • 2Department of Computing, Imperial College London, London, United Kingdom
  • show less
    DOI: 10.1088/1674-4926/41/2/022403 Cite this Article
    Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403 Copy Citation Text show less

    Abstract

    Training deep neural networks (DNNs) requires a significant amount of time and resources to obtain acceptable results, which severely limits its deployment in resource-limited platforms. This paper proposes DarkFPGA, a novel customizable framework to efficiently accelerate the entire DNN training on a single FPGA platform. First, we explore batch-level parallelism to enable efficient FPGA-based DNN training. Second, we devise a novel hardware architecture optimised by a batch-oriented data pattern and tiling techniques to effectively exploit parallelism. Moreover, an analytical model is developed to determine the optimal design parameters for the DarkFPGA accelerator with respect to a specific network specification and FPGA resource constraints. Our results show that the accelerator is able to perform about 10 times faster than CPU training and about a third of the energy consumption than GPU training using 8-bit integers for training VGG-like networks on the CIFAR dataset for the Maxeler MAX5 platform.
    $σ(k)=2(1k),kN+,Q(x,k)=Clip{σ(k)×round[xσ(k)],1+σ(k),1σ(k)}.$()

    View in Article

    $ σ(k)=2(k1),kN+,Q(x,k,shift)=Clip{(x+round_value)shift,1+σ(k),1σ(k)},round_value=1(shift1). $()

    View in Article

    $ shift(x)=round(log2x). $()

    View in Article

    $ shift(x)=ceil(log2x). $()

    View in Article

    $ shift(x)=(leading1(x)+1). $()

    View in Article

    $ WU(L,+L),L=max{6/nin,Lmin},Lmin=1, $()

    View in Article

    $ ashift=log2(max{shift(Lmin/L),0}). $()

    View in Article

    $ aq=Q(a,kA,ashift). $()

    View in Article

    $ eq=Q(e,kE,shift(max|e|)), $()

    View in Article

    $ eq=Q(e,kE,shift(or|e|)), $()

    View in Article

    $ gq=Bernoulli{(η×g)gshift},gshift=shift(or|g|)), $()

    View in Article

    $ gq=Clip{(η×g+round_value)gshift,1+σ(k),1σ(k)},gshift=shift(or|g|)),round_value=random_intmod(1gshift), $()

    View in Article

    $ BRAM=4×TB×TI2×(2×LI)+4×TI2×LWBRAMSIZE, $()

    View in Article

    $ DSP=TB×TI×Dmul+TB×Al×Dadd+TB×Dadd, $()

    View in Article

    $ LUT=TB×TI×β+TB×δ, $()

    View in Article

    $ BWCONV=(TB×LI+TB×TIN×LO+LW)×f,BWFC=(TB×LI+TB×TIN×LO+TI×LW)×f, $()

    View in Article

    $ BWauxiliary=(2×TB×LI+2×TB×LO)×f. $()

    View in Article

    $ TCONV=B×C×K×F×H×WTI×TB×f,TFC=B×C×FTI×TB×f. $()

    View in Article

    $ TCONV=BTB×C×KTI×FTI×H×WTITB×TI×f,TFC=BTB×CTI×FTITB×TI×f,XT=ceil(X/T)×T, $()

    View in Article

    $ Tauxiliary=BTB×C×H×WTB×f, $()

    View in Article

    $Extra \left or missing \right \right. \end{array} $()

    View in Article

    Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403
    Download Citation