Towards efficient deep neural network training by FPGA-based batch-level parallelism

Cheng Luo; Man-Kit Sit; Hongxiang Fan; Shuanglong Liu; Wayne Luk; Ce Guo

doi:10.1088/1674-4926/41/2/022403

Journals >Journal of Semiconductors >Volume 41 >Issue 2 >Page 022403 > Article

Journal of Semiconductors
Vol. 41, Issue 2, 022403 (2020)

Towards efficient deep neural network training by FPGA-based batch-level parallelism

Cheng Luo¹, Man-Kit Sit², Hongxiang Fan², Shuanglong Liu²..., Wayne Luk² and Ce Guo²|Show fewer author(s)

Author Affiliations

¹State Key Laboratory of ASIC and System, Fudan University, Shanghai 200050, China

²Department of Computing, Imperial College London, London, United Kingdom

show less

DOI: 10.1088/1674-4926/41/2/022403 Cite this Article

Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403 Copy Citation Text

EndNote(RIS)

BibTex

Plain Text

show less

Fig. 1. A overview of inference and training processes on the convolutional layer.

Download full size | View in the Article

Fig. 2. (Color online) Comparison of BCHW and CHWB patterns.

Download full size | View in the Article

Fig. 3. (Color online) The tiling flow for convolution.

Download full size | View in the Article

Fig. 4. (Color online) System overview.

Download full size | View in the Article

Fig. 5. (Color online) Hardware architecture of GEMM kernel.

Download full size | View in the Article

Fig. 6. (Color online) Input double buffer supporting matrix transposition.

Download full size | View in the Article

Fig. 7. (Color online) The DarkFPGA framework.

Download full size | View in the Article

Fig. 8. (Color online) Performance and resource consumption experiments under different design space using int8 weights. (a) Computational time. (b) Performance evaluation. (c) Resource consumption.

Download full size | View in the Article

Fig. 9. (Color online) Performance comparisons between homogeneous system and heterogeneous system.

Download full size | View in the Article

Parameter	Description
	the batch size of training examples
	the size of channel
	the size of filter
	the kernel size of weights
	the height of frames
	the width of frames

Table 1. Parameters for FPGA training.

View in the Article

Layer	B	C	F	H × W	K
CONV1	128	3	128	32 × 32	3 × 3
CONV2	128	128	128	32 × 32	3 × 3
MAXPOOLING	128	128	128	16 × 16	2 × 2
CONV3	128	128	256	16 × 16	3 × 3
CONV4	128	256	256	16 × 16	3 × 3
MAXPOOLING	128	256	256	8 × 8	2 × 2
CONV5	128	256	512	8 × 8	3 × 3
CONV6	128	512	512	8 × 8	3 × 3
MAXPOOLING	128	512	512	4 × 4	2 × 2
FC	128	8096	1024	–	–
FC	128	1024	10	–	–
SSE	128	10	10	–	–

Table 2. The network architecture in experiment.

View in the Article

Parameter	CPU	GPU	DarkFPGA
Platform	Intel Xeon X5690	GTX 1080 Ti	MAX5 Platform
No. of cores	6	3584	−
Compiler	GCC 5.4.0	CUDA 9.0	Maxcompiler 2019.2
Flag	-Ofast	−	−
Frequency (GHz)	3.47	1.58	0.2
Precision	32-bit floating point	32-bit floating point	8-bit fixed point
Technology (nm)	32	28	16
Processing time per batch (ms)	66439 (3270)	126 (53.4)	331
Threads	1 (24)	−	−
Power (W)	131 (204)	187 (217)	13.5
Energy (J)	8712 (667.1)	23.6 (11.6)	4.5
Energy efficiency	1x (13x)	369x (751x)	1936x

Table 3. Performance comparison among FPGA, CPU and GPU.

View in the Article

Accelerator	Platform	Config	Model dataset	LUTs (kW)	DSPs efficiency	Performance (GOPS)	Throughput (image/s)
(1) '−' means this metrics is not provided on their papers, '≈' indicate that this value is obtained by approximate estimates. (2) The accelerator from Ref. [29] didn't compute the gradients for training. (3) The power consumption of Ref. [29] measured from entire development board when our power consumption is measured from single FPGA chip.
F-CNN^[29]FCCM 16	Altera Stratix V	8 FPGA	LeNet-5 MNIST	−	−	−	7
FPDeep^[31]FCCM 18	Virtex7 VC709	10 FPGA	AlexNet imagenet	≈ 460 per FPGA	≈ 2880 per FPGA	≈ 1022 per FPGA	−
DiCecco et al.^[33]FPGA 18	Xilinx XCKU115	1 FPGA	LeNet-like CIFAR10	≈ 530	≈ 883	−	522
Nakahara et al.^[34]FPL 19	UltraScale+ XCVU9P	1 FPGA	VGG16 CIFAR10	934	1106	−	4878
Sean et al.^[35]FPT 19	Zynq ZCU111	1 FPGA	VGG-16 CIFAR10	73.1	1037	−	3.3
DarkFPGA	UltraScale+ XCVU9P	1 FPGA	VGG-like CIFAR10	480	4202	1417	386.7

Table 4. Performance comparison of different FPGA-based training accelerators.

View in the Article

Algorithm 1: Pseudocode for training convolutional layers
1 Forward propagation:
2 forb = 1 toBdo
3 forc = 1 toC× Kdo
4 forf = 1 toFdo
5 forim = 1 toH *Wdo
6 A_l+1[b][f][im] += W_l[f][c] *A_l[b][c][im]
7 Backward propagation:
8 forb = 1 toBdo
9 forc = 1 toC× Kdo
10 forf = 1 toFdo
11 forim = 1 toH *Wdo
12 E_l[b][c][im] += W_l[f][c] *E_l+1[b][f][im]
13 Gradient Generation:
14 forb = 1 toBdo
15 forc = 1 toC× Kdo
16 forf = 1 toFdo
17 forim = 1 toH *Wdo
18 G_l[b][f][c] += A_l[b][c][im] *E_l+1[b][f][im]

Table 5. [in Chinese]

View in the Article

Algorithm 2: Pseudocode of tiled matrix multiplication
1 Consider that the weight matrix and gradient matrix are transferred into TI × TI tiled blocks, where input frames, output frames and error frames are transferred as 3-dimensions TB × TI × TI tiled blocks. In particular, the input frames and output frames of fully-connected layer are transferred as 2-dimensions TB × TI tiled blocks
2Convolutional forward propagation:
3forf = 1 toF/TIdo
4forim = 1 to (H * W)/TIdo
5forb = 1 toB/TBdo
6forc = 1 toC× K/TIdo
7Al+1(b)(f, im)c = Wl(f, c) × Al(b)(c, im)
8Al+1(b)(f, im) + = Al+1(b)(f, im)c
9 Quantize (Al+1(b)(f, im))
10 Output Al+1(b)(f, im)
11Convolutional backward propagation:
12forc = 1 todo
13forim = 1 todo
14forb = 1 todo
15forf = 1 todo
16
17
18 Output
19Convolutional gradients generations:
20forf = 1 todo
21forc = 1 todo
22forb = 1 todo
23forim = 1 todo
24
25
26
27 Output
28Fully-connected forward propagation:
29forf = 1 todo
30forb = 1 todo
31forc = 1 todo
32
33
34 Output