Principal Investigator Daniel Sanchez
Project Website http://qcri.csail.mit.edu/node/24
Current shared computing platforms, from small clusters to large datacenters, suffer from low utilization, wasting billions of dollars in energy and infrastructure every year. Low utilization stems from a disconnect between layers of the hardware and software stack. The goal of this proposal is to investigate and develop integrated intra- and inter-node resource management techniques that provide both near-peak utilization and guaranteed high performance in shared environments.
To this end, this project consists of three main thrusts:
(1) Elastic multicore systems, which combine recent hardware support for fast resource management with a novel software runtime to make hardware adaptation work for, not against, performance guarantees. Elastic multicores will use different hardware resources (such as cores, caches, and power) to achieve a given performance target as efficiently as possible, and safely share resources among guaranteed-performance and best-effort applications.
(2) Novel solutions to enable collaborative multi-tenancy, where resource-intensive workloads are co-scheduled and placed using fine-grained, automatically-collected resource usage profiles, considering aspects such as cache and memory bandwidth sharing.
(3) A shared system prototype that enables QF computing users to aggressively colocate applications on shared many-core nodes. The system will guarantee the latency requirement of performance-critical tasks (such as Al Jazeera video processing) while achieving high system utilization with intelligent placement of batch tasks such as HPC and data analytics.