Ultra Low Latency and High-Performance Deep Learning Processor With FPGA
Check out this ultra low latency and high-performance Deep Learning Processor (DLP) with FPGA.
Join the DZone community and get the full member experience.Join For Free
Image recognition and analysis have always been a staple in Alibaba's research and development (R&D) projects, and have played an important role in Alibaba's product innovation. However, these applications typically involve high workloads with strict requirements on service quality. Current solutions such as GPU are not able to balance the low latency and high-performance requirements at the same time.
In order to provide a good user experience while applying Deep Learning, Alibaba's Infrastructure Service Group and the algorithm team from Machine Intelligence Technologies have architected an ultra-low latency and high-performance DLP (Deep Learning processor) on FPGA.
The DLP FPGA can support sparse convolution and low precision data computing at the same time, while a customized ISA (Instruction Set Architecture) was defined to meet the requirements for flexibility and user experience. Latency test results with Resnet18 (sparse kernel) show that Alibaba's FPGA has a delay of only 0.174ms.
In this article, we will briefly discuss how Alibaba and the team from Machine Intelligence Technologies are able to achieve such a feat with the new DLP FPGA.
Alibaba's newly developed DLP have 4 types of modules, which are classified based on their functions.
- Computing: Convolution, Batch Normalization, Activation, and other calculations
- Data Path: Data storage, movement, and reshaping
- Parameter: Storage weight and other parameters, decoding
- Instruction: Instruction unit and global control
The Protocol Engine (PE) in the DLP can support:
- Int4 data type input.
- Int32 data type output.
- Int16 quantization
This PE also offers over 90% efficiency. Furthermore, the DLP's weight loading supports CSR Decoder and data pre-fetching.
Re-training is needed to develop a high accuracy model. There are 4 main steps illustrated below to get both sparse weight and low precision data feature map.
We used an effective method to train the Resnet18 model to sparse and low precision (1707.09870). The key component in our method is discretization. We focused on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural networks. We then modeled this problem as a discretely constrained optimization problem.
Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decoupled the continuous parameters from the discrete constraints of the network and cast the original hard problem into several subproblems. We proposed to solve these subproblems using extragradient and iterative quantization algorithms, which lead to considerably faster convergence compared to conventional optimization methods.
Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-of-the-art approaches when coming to the extremely low bit neural network.
As mentioned previously, only having low latency is not enough for most online service and usage scenarios since the algorithm model will change frequently. As we know, the FPGA development cycle is very long; it usually takes a few weeks or months to finish a customized design. In order to solve this challenge, we designed an industry standard architecture (ISA) and compiler to reduce model upgrade time to only a few minutes.
The SW-HW co-development platform consists of the following items:
- Compiler: Model graph analysis and instruction generation.
- API/Driver: CPU-FPGA DMA picture reshape, weight compression.
- ISA Controller: Instruction decoding, task scheduling, multi-thread pipeline management.
The DLP was implemented on Alibaba designed FPGA Card, which has PCIe and DDR4 memory. The DLP, combined with this FPGA card, can benefit application scenarios such as online image search on Alibaba.
FPGA test results with Resnet18 show that our design achieved ultra-low level latency, meanwhile, maintaining very high performance with less than 70W chip power.
Thanks, and let us know your thoughts in the comments section!
Published at DZone with permission of Leona Zhang. See the original article here.
Opinions expressed by DZone contributors are their own.
Does the OCP Exam Still Make Sense?
Integration Testing Tutorial: A Comprehensive Guide With Examples And Best Practices
AI Technology Is Drastically Disrupting the Background Screening Industry
How to Load Cypress Chrome Extension