The recent success of machine learning, with the help of emerging techniques, such as convolutional neural networks, have been rapidly changing the way many traditional signal processing problems are being solved, including vision processing, speech recognition, and other prediction and optimization problems. However, neural networks require a large number of weight parameters and processing power that are difficult to accommodate efficiently using a normal CPU architec- ture. This necessitates dedicated on-chip solutions.
A major challenge in recent on-chip neural network processors is reducing the energy consumed by memory accesses, as the cost for data operations becomes relatively cheaper than the cost for data movement in recently advanced processes. One approach is to simply reduce the amount of data movement by using compression schemes (i.e., reducing the bit-width of weights and activations).
Han, et al. develop a deep compression technique to non-uniformly quantize floating point weights to 4-bit values, without any loss of accuracy. This was further extended to quantizing to only 2-bit ternary weights. Another approach is to increase the memory capacity, for example with 3-D stacked memory, to reduce the required number of costly external DRAM accesses.
The proposed design takes full advantage of these compression schemes by directly integrating the decompression within the processing element. In addition, the design can be reconfigured to perform more general fixed-point computations with variable bit-widths. Combining this with a closely integrated memory chip through 3-D stacking makes it possible to run large networks with less data movement to and from the external DRAM, resulting in improved energy efficiency compared to other implementations.