Scalable Memory Hierarchies with Fine-Grained QoS Guarantees

Principal Investigator Daniel Sanchez

Project Website http://www.nsf.gov/awardsearch/showAward?AWD_ID=1318384&HistoricalAwards=false

Project Start Date August 2013

Project End Date  July 2017

Multicore chips are now mainstream, and increasing the number of cores per chip has become the primary way to improve performance. Current multicores rely on sophisticated cache hierarchies to mitigate the high latency, limited bandwidth, and high energy of main memory accesses, which often limit system performance. These on-chip caches consume more than half of chip area, and most of this cache space is shared among all cores. Sharing this capacity has major advantages, such as improving space utilization and accelerating core-to-core communication, but poses two fundamental problems. First, with more cores, cache accesses take longer and consume more energy, severely limiting scalability. Second, concurrently executing applications contend for this shared cache capacity, which can cause unpredictable performance degradation among them. The goal of this project is to redesign the cache hierarchy to make it both highly scalable, and to provide strict isolation among competing applications, enabling end-to-end performance guarantees. If successful, this work will improve the performance and energy efficiency of future processors, enabling systems with larger numbers of cores than previously possible. Moreover, these systems will eliminate interference and enforce quality of service guarantees among competing applications, even when those applications are latency-critical. This will enable much higher utilization of shared computing infrastructure (such as cloud computing servers), potentially saving billions of dollars in IT infrastructure and energy consumption.

To achieve the dual goals of high scalability and quality-of-service (QoS) guarantees efficiently, this project proposes an integrated hardware-software approach, where hardware exposes a small and general set of mechanisms to control cache allocations, and software uses these mechanisms to implement both partitioning and non-uniform access policies efficiently. At the hardware level, a novel cache organization provides thousands of fine-grained, spatially configurable partitions, implements lightweight monitoring and reconfiguration mechanisms to guide software policies effectively, and supports full-system scalable cache coherence cheaply. At the software level, a system-level runtime leverages this hardware to implement dynamic data classification, placement, migration, and replication mechanisms, maximizing system performance and efficiency, while at the same time enforcing the strict QoS guarantees of latency-critical workloads, transparently to applications. Combined with existing bandwidth partitioning approaches, these techniques will enforce full-system QoS guarantees by controlling all on-chip shared resources (caches, on-chip network, and memory controllers). In addition, the infrastructure and benchmarks developed as part of this project will be publicly released, allowing other researchers to build on the results of this work, and enabling the development of course projects and other educational activities in large-scale parallel computer architecture, both at MIT and elsewhere.