Song Han

Robert J Shillman (1974) Career Development Assistant Professor in Electrical Engineering and Computer Science

Democratizing artificial intelligence with deep compression

Democratizing artificial intelligence with deep compression

Dr. Song Han designs innovative algorithms and hardware systems based on his deep compression technique for machine learning.

By: Daniel de Wolff

Prior to joining the Department of Electrical Engineering and Computer Science at MIT, Song Han received his PhD in electrical engineering from Stanford University, where he and his advisor, computer architecture trailblazer, William Dally, invented the foundation of his current work as principal investigator at MIT HAN Lab. The technique pioneered by Han is called deep compression; it compresses deep neural networks by 10X to 50X, making them more efficient without losing accuracy.

First proposed in 1944, neural networks are an approach to machine learning intended to mimic the activity of the human brain. Composed of groups of interconnected input and output nodes (or artificial neurons) connected to other nodes for the purpose of processing and communicating data, neural net systems function as a framework for machine learning algorithms and are particularly useful in the realms of image recognition, speech recognition, and natural language processing. In neural networks, raw data is fed into an input node then mapped to an output node that classifies the data.

While the interest in a neural net approach to artificial intelligence has waxed and waned since its inception, the idea has experienced a resurgence in conjunction with the advent of increased computing power. It has also evolved as a concept to include deep neural nets, which are essentially more sophisticated, powerful networks composed of many layers of overlapping clustered nodes, capable of sorting through vast quantities of unstructured data.

The technique is commonly referred to as deep learning, and it currently powers our most significant artificial intelligence-inspired advancements (e.g., self-driving cars, cancer detection, and FaceID). Deep neural nets are a powerful tool, but they consume a tremendous amount of computation and memory, requiring high-performance hardware with vast stores of energy to process all of the data necessary to train deep learning algorithms—which makes them difficult to deploy on embedded systems with limited hardware resources, such as mobile phones and Raspberry Pi.

The goal of my research is to make AI more efficient, not only in large scale data centers, but also on edge devices, phones, sensors, microcontrollers, so every hardware device, even small devices, can have AI inside.

Han’s deep compression to address storage limitations of deep neural nets involves a three-stage process or pipeline composed of pruning, quantization, and Huffman coding. Pruning trims nodal connections, reducing their number by 9X to 13X while maintaining accuracy. Quantization, reduces bandwidth per connection, further minimizing model size and memory footprint, improving compression by 27X to 37X. Huffman coding then further compresses the network by 35X to 49X while allowing the original data to be perfectly reconstructed from the compressed data.

Song Han
Please login to view this video.

“The goal of my research is to make AI more efficient, not only in large scale data centers, but also on edge devices, phones, sensors, microcontrollers, so every hardware device, even small devices, can have AI inside.”

Imagine a world where every device is embedded with AI. “AI everywhere will change our lives,” says Han. “A ‘dumb’ camera becomes a ‘smart’ camera with artificial intelligence. Previously it could only shoot videos and post surveillance. After adding an AI chip in my lab, we can do pedestrian detection, car detection, and gesture recognition.”

What’s more, Han’s deep compression could play a major role in mobile AI as internet of things (IoT) devices become more prevalent. IoT manufacturers are realizing the benefits of doing advanced processing and analytics on the device as opposed to on the cloud. This on-device approach is called edge computing, and its decentralized, distributed method of operation could be considered the counterpart to cloud computing. It reduces latency for critical applications, lowers dependence on the cloud, cuts connectivity costs, improves privacy, and helps manage the massive amounts of data being generated by the IoT.

Practical applications for IoT devices, edge computing, and by extension deep compression and hardware acceleration abound. On-device AI eliminates the need for device-to-cloud data dispersal, which means, among other things, that hospitals can keep their patients’ data secure. New retail also looks to benefit from the advances of edge computing in a number of ways, increasing personalization for customers while maintaining privacy (think facial recognition at check-out kiosks) and helping retailers understand availability of products on shelves with real-time sensors.

Han mentions self-driving vehicles as an area set to benefit from his research. “With self-driving cars we are limited by the small form factor of the processor, the heat generated by the processor, not to mention the speed.” The frame rate, which is essential to an autonomous vehicle’s ability to identify and avoid obstacles, needs to be more than 30 frames per second in real time. “And you have to process multiple sensors to do sensor fusion at the same time. That’s a lot of computation! Efficient hardware and methods for deep learning really help those applications. And bringing AI to the edge makes it possible for us to process data locally, without having to rely on a data center.”

While his PhD work focused on compressing deep neural networks by relying on human heuristics, his current research at MIT addresses the design automation of efficient, smaller neural network architectures and hardware architectures. “Having an automated tool chain to do design space exploration opens up many doors in this space. From a commercial perspective, if you have ten customers you can of course have ten engineers serving each customer, but when the business grows and there are thousands of customers, you want design automation rather than relying on thousands of engineers. In HAN lab, we’re working on hardware-centric AutoML to optimize the algorithm and the hardware in an automated manner. Our recent work on ProxylessNAS automatically searched a model that’s 1.8x faster than the human engineered model MobileNet-V2, the current industry standard, while reducing the search cost by 240x compared to previous work. ProxylessNAS will appear at ICLR 2019.”

First, I want to make AI more efficient so that our lives will continue to benefit from AI. Second, I want to cultivate more PhD students and qualified researchers capable of doing cutting-edge AI research at MIT.

For the uninitiated: ICLR is the International Conference on Learning Representations, otherwise known as the premier gathering of professionals dedicated to the advancement of deep learning (i.e., representation learning). ProxylessNAS: NAS stands for Neural Architecture Search. The goal is to automatically search a good neural network architecture, rather than rely on a human engineer to design it. MobileNet-V2 is an example of human-designed model, widely used by Google.

“ProxylessNAS” means Han and his team directly search a model that’s specialized for the target task and target hardware. “We don’t need a proxy task and we don’t use one model for all hardware. Instead, we can specialize the model for target hardware. Conventional NAS is too expensive at 48,000 GPU hours. We need only 200 GPU hours, that’s why we can do such specialization and be proxyless.”

We’re living in a post-Moore’s Law era, where we can no longer count on exponential growth in computing power and corresponding decreases in relative cost. “The end of Moore’s Law means we no longer get a free lunch from technologies scaling. At the same time, we are in a post-ImageNet era,” says Han. “Machine learning engineers are dealing with more data and more complicated problems that require more computation. Our goal at HAN Lab is to design domain-specific architectures for AI and also to work on efficient algorithms that are hardware friendly; to work on joint co-designing of efficient algorithms and hardware.”

“I have two main goals,” says Han. “First, I want to make AI more efficient so that our lives will continue to benefit from AI. Second, I want to cultivate more PhD students and qualified researchers capable of doing cutting-edge AI research at MIT.”