A survey of FPGA design for AI era

Zhengjie Li; Yufan Zhang; Jian Wang; Jinmei Lai

doi:10.1088/1674-4926/41/2/021402

Journals >Journal of Semiconductors >Volume 41 >Issue 2 >Page 021402 > Article

Journal of Semiconductors
Vol. 41, Issue 2, 021402 (2020)

A survey of FPGA design for AI era

Zhengjie Li, Yufan Zhang, Jian Wang, and Jinmei Lai

Author Affiliations

State Key Lab of ASIC and System, School of Microelectronics, Fudan University, Shanghai 201203, China

show less

DOI: 10.1088/1674-4926/41/2/021402 Cite this Article

Zhengjie Li, Yufan Zhang, Jian Wang, Jinmei Lai. A survey of FPGA design for AI era[J]. Journal of Semiconductors, 2020, 41(2): 021402 Copy Citation Text

EndNote(RIS)

BibTex

Plain Text

show less

Fig. 1. (Color online) Simplified architecture of (a) baseline DSP and (b) enhanced DSP.

Download full size | View in the Article

Fig. 2. (Color online) Proposed extra carry chain architecture modifications.

Download full size | View in the Article

Fig. 3. (Color online) The difference of CNN and BNN: (a) CNN, (b) BNN and (c) XNOR replace multiplication for BNN.

Download full size | View in the Article

Fig. 4. (Color online) ALM modifications: (a) ALM modification 1 and (b) ALM modification 2.

Download full size | View in the Article

Fig. 5. (Color online) Intel AgileX Architecture. (a) AgileX Architecture. (b) Advanced memory hierarchy.

Download full size | View in the Article

Fig. 6. (Color online) ACAP Architecture. (a) ACAP architecture. (b) AI engine.

Download full size | View in the Article

No.	Inventor	Module	Goal	Enhancement	Advantage
1	A Boutros et al.^[14]	DSP	Low-precision computation	DSP block to support 9-bit and 4-bit multiplication	Pack 2 × as many 9-bit and 4 × as many 4-bit multiplications compared to the baseline Arria-10-like DSP
2	Intel^[15]	DSP	Low-precision computation	AgileX supports INT8 computation	Provide 2 × the number of 9 × 9 multipliers and doubles the amount of INT8 operations compared to the prior generation.
3	Intel^[15]	DSP	High-accuracy computation	AgileX supports FP32, FP16 and BFLOAT16	Provide up to 40 TFLOPs FP16 or BF16, or up to 20 TFLOPs FP32 DSP performance
4	Xilinx^[16]	DSP	Low-precision computation	DSP Engine supports INT8 computation	VC1902 of AI Core Series provides INT8 peak performance up to 13.6 TOP/s^[25]
5	Xilinx^[16]	DSP	High-accuracy computation	DSP Engine supports FP32 and FP16	VC1902 of AI Core Series provides FP32 peak performance up to 3.2 TFLOP/s^[25]
6	A Boutros et al.^[17]	ALM	Low-precision computation	ALM with extra carry chain, or more adders, or shadow multipliers	Extra carry chain provides a 1.5 × increase in MAC density; 4-bit adder and 9-bit shadow multiplier provides a 6.1 × increase in MAC density
7	J H Kim et al. ^[18]	ALM/CLB	Support BNN	Extra carry chain which propagates sum; additional FA	The first change reduces ALM/LUT usage by 23%–44%; the second change reduces ALM/LUT usage by 39%–60%^[18].
8	Intel^[15]	Memory	Support more memory resources	Embedded memory, in-package HBM, off-chip memory interfaces	On-chip memory includes MLABs (640b), block RAM (M20K), and eSRAM (18 MB); in-package memory includes HBM2E; on-board memory includes DDR4/5, QDR/ RLDRAM, Intel Optane DC Persistent Memory
10	Xilinx^[16]	Memory	Support more memory resources	Embedded memory, off-chip memory interfaces	Distributed-RAM(64-bit per CLB), block RAM (36 KB), UltraRAM (288 KB), Accelerator RAM; DDR4/LPDDR4
11	Xilinx^[20]	AI Engine	Artificial intelligence	An array of VLIW SIMD high-performance processors^[20]	Deliver up to 8X silicon compute density at 50% the power consumption of traditional programmable logic solutions^[20]
12	Intel^[15]	Platform	For data-centric world	10-nm Agilex; innovative chipletarchitecture^[28]	Deliver up to 40% higher core performance, or up to 40% lower power over previous generation FPGAs^[28]
13	Xilinx^[16]	Platform	Adaptive compute acceleration platforms	Intelligent engines (AI and DSP), adaptable engines, andscalar engines	Achieve performance improvements of up to 20X over today's fastest FPGA implementations and over 100X over today's fastest CPU implementations^[19]