Bandwidth-Efficient Deep Learning: Algorithm and Hardware co-Design

In the post-ImageNet era, computer vision and machine learning researchers are solving more complicated Artificial Intelligence (AI) problems using larger data sets driving the demand for more computation. However, we are in the post-Moore’s Law world where the amount of computation per unit cost and power is no longer increasing at its historic rate. This mismatch between supply and demand for computation highlights the need for co-designing efficient machine learning al- gorithms and domain-specific hardware architectures. By performing optimizations across the full stack from application through hardware, we improved the efficiency of deep learning through smaller model size, higher prediction accuracy, faster prediction speed, and lower power consumption.

The approach starts by changing the algorithm, using “Deep Compression” that significantly reduces the number of parameters and computation requirements of deep learning models by pruning, trained quantization, and variable length coding. “Deep Compression” can reduce the model size by 18× to 49× without hurting the prediction accuracy. We also discovered that pruning and the sparsity constraint not only applies to model compression but also applies to regularization, and we proposed dense-sparse-dense training (DSD), which can improve the prediction
accuracy for a wide range of machine learning tasks. To efficiently implement “Deep Compression” in hardware, we developed EIE, the “Efficient Inference Engine,” a domain-specific hardware accelerator that performs inference directly on the compressed model which significantly saves memory bandwidth. Taking advantage of the compressed model, and being able to deal with the irregular computation pattern efficiently, EIE improves the speed by 13× and energy efficiency by 3,400× over GPU.