A Hardware and Software Architecture for Data-Centric Parallel Computing

Principal Investigator Daniel Sanchez

Project Website http://www.nsf.gov/awardsearch/showAward?AWD_ID=1452994&HistoricalAwards=false

Project Start Date February 2015

Project End Date  January 2020

Energy efficiency is the key challenge facing computer systems. To improve performance under a limited energy budget, systems are becoming increasingly parallel, featuring many smaller and simpler cores, and heterogeneous, featuring cores specialized for certain tasks. Even with these improvements, two critical challenges remain. First, without reducing data movement, memory accesses and communication will dominate energy consumption. Thus, limiting data movement must become a primary design objective. Second, these systems will be highly complex, and will need powerful abstractions to shield programmers from this complexity. Current systems are designed in a computation-centric way that is a poor match for these challenges. Memory hierarchies are hardware-managed and opaque to software, which needlessly increases data movement; and runtimes lack the proper hardware mechanisms and software policies to manage heterogeneous resources efficiently.

This research project takes a holistic approach to addressing these challenges, by co-designing an architecture and runtime system that efficiently run dynamic parallel applications on systems with heterogeneous cores and memories. Redesigning hardware to be directly exploited by a dynamic runtime enables (a) many more opportunities to reduce data movement, (b) better usage of heterogeneous resources, and (c) much faster adaptation to changing application needs and available resources. Three key components underlie this design.

First, a scalable memory system incorporates combinations of heterogeneous memory technologies to improve efficiency, and exposes them to software, which can divide these physical memories into many virtual cache and memory hierarchies to finely control data placement. Second, specialized programmable engines orchestrate communication among cores, accelerate intensive runtime functions such as load balancing, and monitor how tasks use hardware resources to guide runtime decisions. Third, a hardware-accelerated runtime leverages this novel architectural support to place data and computation to minimize data movement, use the most suitable core for each task, and quickly respond to changing application needs. This runtime targets a high-level programming model that lets programmers express fine-grained and irregular task, data, and pipeline parallelism. These techniques build on an analytical design approach that makes hardware easy to understand and predict, and enables runtimes to navigate multi-dimensional tradeoffs efficiently.

If successful, this project will make heterogeneous systems more efficient, more broadly applicable, and easier to program. It will especially benefit applications with dynamic and fine-grained parallelism, advancing key emerging domains where these workloads are pervasive, such as graph analytics and online data-intensive services. In addition, the infrastructure developed as part of this project will be publicly released, enabling others to build on the results of this work.