Entry Date:
May 14, 2019

The Impact of the Unknown Unknowns


As part of the Quantifying the Uncertainty in Data Exploration (QUTE) project, we started to develop techniques to automatically quantify the different types of errors within data exploration pipelines. Of course, there are the obvious types of uncertainty, which are usually visualized using error bars and computed through sampling techniques. However, many other types of error exist, which are often ignored. For example, it is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. However, even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? (2) What is the impact of any unknown (i.e., unobserved) data on query results? As a first step within the QUDE project, we are developing and analyzing techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Surprisingly, our statistical techniques are parameter-free and do not assume prior knowledge about the distribution. Through theoretical analysis and a series of experiments, we show that estimating the impact of unknown unknowns is invaluable for assessing the results of aggregate queries over integrated data sources. For this project, we work together with the Center of Evidence-Based Medicine (EBM) at Brown University.