Principal Investigator John Williams
Co-investigator Abel Sanchez
Project Website http://geospatial.mit.edu/globalITsim.html
The simulator takes as input the hardware configuration deployed in each data center, the network topology, bandwidth and latency, application profile, resources allocated by individual client requests, and details on background processes. Then, it outputs resource (CPU, memory and network) usage, and user response times globally.
The simulator goal -- The goal is to provide datacenter operators a tool that allows understanding and predicting the consequences of the deployment of new network topologies, hardware configurations or software applications in a global data infrastructure, without affecting the service.
Individual elements -- This simulation of large-scale IT infrastructures not only reproduces the behavior of data centers at a macroscopic scale, but allows operators to navigate down to the detail of individual elements, such as processors or network links.
Multicore -- The simulator is an Multi-Agent Discrete Event Simulator implemented using multicore algorithms for speed. It was constructed using a multi-layered approach and was optimized for multicore scalability.
IT infrastructures in global corporations are appropriately compared with nervous systems, in which body parts (interconnected datacenters) exchange signals (request- responses) in order to coordinate actions (data visualization and manipulation). A priori inoffensive perturbations in the operation of the system or the elements composing the infrastructure can lead to catastrophic consequences. Downtime disables the capability of clients reaching the latest versions of the data and/or propagating their individual contributions to other clients, potentially costing millions of dollars to the organization affected. The imperative need of guaranteeing the proper functioning of the system not only forces to pay particular attention to network outages, hot-objects or application defects, but also slows down the deployment of new capabilities, features and equipment upgrades. Under these circumstances, decision cycles for these modifications can be extremely conservative, and be prolonged for years, involving multiple authorities across departments of the organization. Frequently, the solutions adopted are years behind state-of-the-art technologies or phased out compared to leading research on the IT infrastructure field. In this study, the utilization of a large-scale data infrastructure simulator is proposed, in order to evaluate the impact of ”what if” scenarios on the performance, availability and reliability of the system. The goal is to provide operators and designers a tool that allows understanding and predicting the consequences of the deployment of new network topologies, hardware configurations or software applications in a global data infrastructure, without affecting the service. The simulator was constructed using a multi-layered approach, providing a granularity down to the individual server component and client action, and was validated against the data infrastructure of a Fortune 500 company.