DAS: A Self-Assembling Database System

Co-investigators Michael Stonebraker , Samuel Madden

Project Website http://dsail.csail.mit.edu/index.php/projects/

Modern data processing systems are designed to be general-purpose, in that they can handle a wide variety of different schemas, data types and data distributions, and provide efficient access to that data through use of optimizers and cost models. The general-purpose nature of current systems results in code bases that don't take advantage of the particular specification of a user’s application and data. However, recent results from our research members have shown how learned models over the user’s data distribution and workload pattern can significantly enhance and sometimes entirely replace index structures and scheduling algorithms.

We are just at the beginning of understanding the potential of using learned models to enhance the traditional algorithms and data structures that form the basis of modern DBMSs, such as B-Trees, Bloom filters, hash maps, sorting algorithms, join algorithms and query optimizers. The core idea is that (1) almost any algorithm and data structure used by a data management system can be improved if something is known about the data distribution and (2) we can synthesize very efficient new algorithms and data structures with embedded models. But first, fundamental research must be done to understand the implications of putting learned models at the core of the fundamental algorithms and data structures of a database system (e.g., the query optimizer).

What do models mean for the worst-case complexity? Is it possible to bound the worst-case complexity through certain models or hybrid approaches? How can we efficiently approximate the empirical cumulative distribution function (CDF) of a data set in the least number of instructions? How can we efficiently synthesize algorithms and data structures and how can we enhance models with traditional algorithm components? In addition, learned data structures may be more vulnerable against adversarial attacks, so we propose to address security concerns.

To answer such questions, we will build a Self-Assembling Database System prototype, called DAS, an open-source research platform for model-driven data processing engines. More concretely, we will investigate to what extent learned models can improve the performance of various components of a large-scale data processing system, such as:

(1) Query Optimization, where models can improve cardinality estimation and the cost model, or provide more efficient means to explore the optimization space.

(2) Data Access, where models can predict the location of a key in a database (i.e., indexing), help to compress the data, or optimize the storage format

(3) Query execution, where models can help find efficient query schedules or potentially speed up sorting or joins.

(4) Advanced Analytics, where models can be used to approximately answer queries or speed up learning tasks over the data.

This is a highly inter-disciplinary research project as it requires a deep understanding of (1) fundamental algorithms and data structures, (2) data-centric systems, (3) machine learning and (4) program synthesis/compilers.