• Journal of Semiconductors
  • Vol. 41, Issue 2, 022403 (2020)
Cheng Luo1, Man-Kit Sit2, Hongxiang Fan2, Shuanglong Liu2, Wayne Luk2, and Ce Guo2
Author Affiliations
  • 1State Key Laboratory of ASIC and System, Fudan University, Shanghai 200050, China
  • 2Department of Computing, Imperial College London, London, United Kingdom
  • show less
    DOI: 10.1088/1674-4926/41/2/022403 Cite this Article
    Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403 Copy Citation Text show less

    Abstract

    Training deep neural networks (DNNs) requires a significant amount of time and resources to obtain acceptable results, which severely limits its deployment in resource-limited platforms. This paper proposes DarkFPGA, a novel customizable framework to efficiently accelerate the entire DNN training on a single FPGA platform. First, we explore batch-level parallelism to enable efficient FPGA-based DNN training. Second, we devise a novel hardware architecture optimised by a batch-oriented data pattern and tiling techniques to effectively exploit parallelism. Moreover, an analytical model is developed to determine the optimal design parameters for the DarkFPGA accelerator with respect to a specific network specification and FPGA resource constraints. Our results show that the accelerator is able to perform about 10 times faster than CPU training and about a third of the energy consumption than GPU training using 8-bit integers for training VGG-like networks on the CIFAR dataset for the Maxeler MAX5 platform.
    $\begin{array}{l} \sigma(k) = 2^{(1-k)}, \quad k \in N_+ , \\ Q(x,k) = {\rm{Clip}}\{ \sigma(k) \times {\rm{round}} \left[\dfrac{x}{\sigma(k)}\right], -1 + \sigma(k), 1 - \sigma(k) \} .\end{array}$()

    View in Article

    $ \begin{array}{l}\sigma(k) = 2^{(k-1)}, \quad k \in N_+, \\ Q(x,k,{\rm{shift}}) = {\rm{Clip}}\left\{ (x + {\rm{round\_value}} ) \gg {\rm{shift}},\right.\\ \qquad\qquad\qquad \left. -1 + \sigma(k), 1 - \sigma(k) \right\} , \\ {\rm{round\_value}} = 1 \ll ({\rm{shift}} - 1). \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{shift}}(x) = {{\rm{round}}({\rm{log}}_2 x )} . \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{shift}}(x) = {{\rm{ceil}}({\rm{log}}_2 x )}. \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{shift}}(x) = ({\rm{leading1}}(x) + 1) . \end{array} $()

    View in Article

    $ \begin{array}{l} W \; U(-L, +L), \; L = {\rm{max}}\{\sqrt{6/n_{\rm {in}}}, L_{{\rm{min}}}\}, \; L_{{\rm{min}}} = 1 , \end{array} $()

    View in Article

    $ \begin{array}{l} a_{\rm {shift}} = {\rm{log}}_2({\rm{max}}\{{\rm{shift}}(L_{\rm{min}}/L), 0\}). \end{array} $()

    View in Article

    $ \begin{array}{l} a_{\rm{q}} = Q(a,k_{\rm{A}},a_{\rm {shift}}). \end{array} $()

    View in Article

    $ \begin{array}{l} e_{\rm{q}} = Q(e,k_{\rm{E}},{\rm{shift}}({\rm{max}}|e|)), \end{array} $()

    View in Article

    $ \begin{array}{l} e_{\rm{q}} = Q(e,k_{\rm{E}},{\rm{shift}}({\rm{or}}|e|)) , \end{array} $()

    View in Article

    $ \begin{array}{l} g_{\rm{q}} = {\rm{Bernoulli}}\{ (\eta\times g ) \gg g_{{\rm{shift}}} \}, \\ g_{{\rm{shift}}} = {\rm{shift}}({\rm{or}}|g|)), \end{array} $()

    View in Article

    $ \begin{array}{l} g_{\rm{q}} = {\rm{Clip}}\left\{ (\eta\times g + {\rm{round\_value}}) \gg g_{{\rm{shift}}},\right.\\ \qquad \left. -\,1 + \sigma(k), 1 - \sigma(k) \right\}, \\ g_{{\rm{shift}}} = {\rm{shift}}({\rm{or}}|g|)) ,\\ {\rm{round\_value}} = {\rm{random\_int}}\;{\rm{mod}} (1\ll g_{{\rm{shift}}}), \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{BRAM}} = \dfrac{4 \times T_{\rm{B}} \times T_{\rm{I}}^2 \times (2\times L_{\rm{I}}) + 4 \times T_{\rm{I}}^2 \times L_{\rm{W}}}{{\rm{BRAM}}_{\rm{SIZE}}}, \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{DSP}} = T_{\rm B} \times T_{\rm I} \times D_{\rm{mul}} + T_{\rm B} \times A_{\rm l} \times D_{\rm{add}} + T_{\rm B} \times D_{\rm{add}}, \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{LUT}} = T_{\rm B} \times T_{\rm I} \times \beta + T_{\rm B} \times \delta , \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{BW}}_{\rm{CONV}} = ( T_{\rm{B}} \times L_{\rm{I}} + \dfrac{ T_{\rm{B}} \times T_{\rm{I}}} { N } \times L_{\rm{O}} + L_{\rm W}) \times f, \\ {\rm{BW}}_{\rm{FC}}= ( T_{\rm{B}} \times L_{\rm{I}} + \dfrac{ T_{\rm B} \times T_{\rm{I}}} { N }\times L_{\rm{O}} + T_{\rm{I}} \times L_{\rm{W}}) \times f, \end{array} $()

    View in Article

    $ \begin{array}{l} {\rm{BW}}_{\rm{auxiliary}} = (2\times T_{\rm B} \times L_{\rm I} + 2\times T_{\rm B} \times L_{\rm O}) \times f. \end{array} $()

    View in Article

    $ \begin{array}{l} {{T}}_{\rm{CONV}}= \dfrac{B \times C \times K \times F \times H \times W}{T_{\rm I} \times T_{\rm B} \times f}, \\ {{T}}_{\rm{FC}}= \dfrac{B \times C \times F }{T_{\rm I} \times T_{\rm B} \times f}. \end{array} $()

    View in Article

    $ \begin{array}{l} T_{\rm{CONV}} = \dfrac{\lceil B \rceil ^{T_{\rm B}} \times \lceil C \times K \rceil ^{T_{\rm I}} \times \lceil F \rceil ^{T_{\rm I}} \times \lceil H \times W \rceil ^{T_{\rm I}} }{T_{\rm B} \times T_{\rm I} \times f}, \\ T_{\rm{FC}} = \dfrac{\lceil B \rceil ^{T_{\rm B}} \times \lceil C \rceil ^{T_{\rm I}} \times \lceil F \rceil ^{T_{\rm I}} }{T_{\rm B}\times T_{\rm I}\times f} , \\ \lceil X \rceil ^{T} = {\rm{ceil}}(X / T) \times T, \end{array} $()

    View in Article

    $ \begin{array}{l} T_{\rm{auxiliary}} = \dfrac{\lceil B \rceil ^{T_{\rm B}} \times C \times H \times W }{T_{\rm B} \times f,} \end{array} $()

    View in Article

    $\begin{array}{l} {\rm{Minimize\;Time}} = T'_{\rm{CONV}} + T'_{\rm{FC}} + T'_{\rm{auxiliary}} ,\\ {\rm{where}}\left\{ \begin{array}{l} {\rm{LUT}} + {\rm{LUT}}_{\rm{fix}} \leqslant {\rm{LUT}}_{\rm{limit}} , \\ {\rm{BRAM}} + {\rm{BRAM}}_{\rm fix} \leqslant {\rm{BRAM}}_{\rm{limit}}, \\ {\rm{DSP}} + {\rm{DSP}}_{\rm{fix}} \leqslant {\rm{DSP}}_{\rm{limit}} , \\ {\rm{BW}} \leqslant {\rm{BW}}_{\rm{limit}}, \\ \end{array} \right. \end{array} $()

    View in Article

    Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403
    Download Citation