AI on FPGAs
Following the seminal paper by Krishevsky et.al. on “ImageNet Classification with Deep Convolutional Neural Networks”, the use of Deep Neural Networks has resulted in considerable progress being made in recent years in areas such as image classification, Go computers, handwriting recognition, natural language processing, financial services and speech recognition.
Diagram of a simplified AlexNet Convolutional Neural Network
Neural Networks rely heavily on massively parallel multiply-accumulate and other operations. The intense compute power needed has meant that GPUs were the initial hardware of choice, rather than CPUs, because of their intrinsic parallelism and availability. This choice of GPUs over other engines has also been promoted by the development of general-purpose programming languages such as CUDA and Machine Learning frameworks such as Caffe and TensorFlow.
GPUs and CPUs operate in a memory-centric manner which increases power usage and introduces latency when running neural network topologies. FPGAs on the other hand can use internal block RAM (BRAM) which has a significantly higher bandwidth than external RAM and requires much less power.
GPUs are highly versatile and very capable, but they were not designed for machine learning and do not necessarily provide the optimum architecture for machine learning applications. This is compensated to some degree by the sheer size of GPUs but at the expense of power consumption. The architectural limitations of GPUs mean that efficiency in machine-learning applications is only achieved by “batching” large numbers of images for simultaneous processing. While this approach increases throughput, it also increases latency which is incompatible with mission-critical real-time applications such as autonomous driving.
Improved accuracy of neural networks has been at the expense of higher complexity. There is also a need to create both low power embedded implementations as well as high performance reconfigurable Data Centre compute engines. Recent research has shown that compute efficiency can be achieved by a variety of techniques, some of which cannot be capitalised by GPUs.
ASICs or ASSPs
These trends have led to significant research into alternative compute engines, implemented on FPGAs or custom silicon devices (ASICs/ASSPs). Custom silicon devices are not optimal for neural network processing, as their architectures are inflexible and cannot be optimised for the desired workload or updated to adopt the benefits of the latest research into DNNs.
Published research has shown for example that an ASSP might be 86% efficient in implementing one type of CNN but only 14% efficient at implementing another CNN and less than 4% efficient at implementing LSTMs. FPGAs on the other hand, can be optimised for each workload through hardware reconfiguration or software programming in a matter of milliseconds. FPGAs can also integrate additional system functions, reducing SWAP-C at the system level.
While ASICs deliver the optimum performance, unit cost and power consumption for one specific application, their extremely high NRE costs and long time to market make them suitable only for the very highest volume applications, such as smart phones.
Additionally, and very importantly, whereas it might take a number of years to adopt new architectures or optimisation techniques into an ASIC/ASSP and get production parts into service, an FPGA can be redesigned in a matter of weeks and devices already deployed in the field can be reprogrammed in minutes.
There are significant advantages to using FPGA technology instead of other technologies:
- Delivers very high performance per watt and low latency
- Network architecture can be optimised for the workload, delivering the optimum performance, cost and power
- FPGA can be re-configured to adopt the benefits of future research and rapidly deliver improvements in architecture and optimisation techniques
- Fast time-to-market, which is of great value in a rapidly evolving field
- DPU can be integrated with other processors, video/vision functions and connectivity blocks to create a complete system on a chip
As machine learning evolves and different neural networks appear, only FPGA-based technology is reconfigurable to accommodate the newest approaches to training and Inference Engines.
For research into alternative platforms see the Omnitek DPU White Paper “The Use of FPGAs for Deep Learning Acceleration”