• Journal of Semiconductors
  • Vol. 41, Issue 2, 021402 (2020)
Zhengjie Li, Yufan Zhang, Jian Wang, and Jinmei Lai
Author Affiliations
  • State Key Lab of ASIC and System, School of Microelectronics, Fudan University, Shanghai 201203, China
  • show less
    DOI: 10.1088/1674-4926/41/2/021402 Cite this Article
    Zhengjie Li, Yufan Zhang, Jian Wang, Jinmei Lai. A survey of FPGA design for AI era[J]. Journal of Semiconductors, 2020, 41(2): 021402 Copy Citation Text show less
    (Color online) Simplified architecture of (a) baseline DSP and (b) enhanced DSP.
    Fig. 1. (Color online) Simplified architecture of (a) baseline DSP and (b) enhanced DSP.
    (Color online) Proposed extra carry chain architecture modifications.
    Fig. 2. (Color online) Proposed extra carry chain architecture modifications.
    (Color online) The difference of CNN and BNN: (a) CNN, (b) BNN and (c) XNOR replace multiplication for BNN.
    Fig. 3. (Color online) The difference of CNN and BNN: (a) CNN, (b) BNN and (c) XNOR replace multiplication for BNN.
    (Color online) ALM modifications: (a) ALM modification 1 and (b) ALM modification 2.
    Fig. 4. (Color online) ALM modifications: (a) ALM modification 1 and (b) ALM modification 2.
    (Color online) Intel AgileX Architecture. (a) AgileX Architecture. (b) Advanced memory hierarchy.
    Fig. 5. (Color online) Intel AgileX Architecture. (a) AgileX Architecture. (b) Advanced memory hierarchy.
    (Color online) ACAP Architecture. (a) ACAP architecture. (b) AI engine.
    Fig. 6. (Color online) ACAP Architecture. (a) ACAP architecture. (b) AI engine.
    No.InventorModuleGoalEnhancementAdvantage
    1A Boutros et al.[14]DSPLow-precision computationDSP block to support 9-bit and 4-bit multiplicationPack 2 × as many 9-bit and 4 × as many 4-bit multiplications compared to the baseline Arria-10-like DSP
    2Intel[15]DSPLow-precision computationAgileX supports INT8 computationProvide 2 × the number of 9 × 9 multipliers and doubles the amount of INT8 operations compared to the prior generation.
    3Intel[15]DSPHigh-accuracy computationAgileX supports FP32, FP16 and BFLOAT16Provide up to 40 TFLOPs FP16 or BF16, or up to 20 TFLOPs FP32 DSP performance
    4Xilinx[16]DSPLow-precision computationDSP Engine supports INT8 computationVC1902 of AI Core Series provides INT8 peak performance up to 13.6 TOP/s[25]
    5Xilinx[16]DSPHigh-accuracy computationDSP Engine supports FP32 and FP16VC1902 of AI Core Series provides FP32 peak performance up to 3.2 TFLOP/s[25]
    6A Boutros et al.[17]ALMLow-precision computationALM with extra carry chain, or more adders, or shadow multipliersExtra carry chain provides a 1.5 × increase in MAC density; 4-bit adder and 9-bit shadow multiplier provides a 6.1 × increase in MAC density
    7J H Kim et al. [18]ALM/CLBSupport BNNExtra carry chain which propagates sum; additional FAThe first change reduces ALM/LUT usage by 23%–44%; the second change reduces ALM/LUT usage by 39%–60%[18].
    8Intel[15]MemorySupport more memory resourcesEmbedded memory, in-package HBM, off-chip memory interfacesOn-chip memory includes MLABs (640b), block RAM (M20K), and eSRAM (18 MB); in-package memory includes HBM2E; on-board memory includes DDR4/5, QDR/ RLDRAM, Intel Optane DC Persistent Memory
    10Xilinx[16]MemorySupport more memory resourcesEmbedded memory, off-chip memory interfacesDistributed-RAM(64-bit per CLB), block RAM (36 KB), UltraRAM (288 KB), Accelerator RAM; DDR4/LPDDR4
    11Xilinx[20]AI EngineArtificial intelligenceAn array of VLIW SIMD high-performance processors[20]Deliver up to 8X silicon compute density at 50% the power consumption of traditional programmable logic solutions[20]
    12Intel[15]PlatformFor data-centric world10-nm Agilex; innovative chipletarchitecture[28]Deliver up to 40% higher core performance, or up to 40% lower power over previous generation FPGAs[28]
    13Xilinx[16]PlatformAdaptive compute acceleration platformsIntelligent engines (AI and DSP), adaptable engines, andscalar enginesAchieve performance improvements of up to 20X over today's fastest FPGA implementations and over 100X over today's fastest CPU implementations[19]
    Table 1. Summary of all enhancements of FPGA for AI era.
    Zhengjie Li, Yufan Zhang, Jian Wang, Jinmei Lai. A survey of FPGA design for AI era[J]. Journal of Semiconductors, 2020, 41(2): 021402
    Download Citation