Northstar: Making Data Science More Interactive

Project Website http://dsail.csail.mit.edu/index.php/vizdom/

Unfolding the potential of Big Data for a broader range of users requires a paradigm shift in the algorithms and tools used to analyse data. Exploring complex data sets needs more than a simple question and response interface. Ideally, the user and the system engage in a ``conversation'', each party contributing what it does best. The user can contribute judgement and direction, while the machine can contribute its ability to process massive amounts of data, and perhaps even predict what the user might require next. Even with sophisticated visualizations, digesting and interpreting large, complex data sets often exceeds human capabilities. ML and statistical techniques can help in these situations by providing tools that clean, filter and identify relevant data subsets. Unfortunately, support for ML is often added as an afterthought: the techniques are buried in black-boxes and executed in an all-or-nothing manner. Results can often take hours to compute, which is not acceptable for interactive data exploration. Moreover, users want to see the result as it evolves. They want to interrupt, change the parameters, features or even the whole pipeline. In other words, currently Data Scientists still use the same text-style batch-interfaces from the 80s, whereas we should explore data more like as envisioned by many movies from James Bond to Minority Report or as outlined in the Microsoft User Interface Vision. While there exist some work on creating very novel interfaces, like the one from the Minority Report, these often ignore the system and ML aspect, and are not really usable in practice, whereas on the other hand the systems community tends to ignore the user interface aspect.

As part of the Northstar project we envision a completely new approach to conducting exploratory analytics. We speculate that soon many conference rooms will be equipped with an interactive whiteboard, like the Microsoft Surface Hub, and that we can use such whiteboards to avoid the week-long back-and-forth interactions between data scientists and domain experts. Instead, we believe that the two can work together during a single meeting using such an interactive whiteboard to visualize, transform and analyze even most complex data on the spot. This setting will undoubtedly help users to quickly arrive at an initial solution, which can be further refined offline. Our hypothesis is that we can make data exploration much easier for laymen users while automatically protecting them from many common errors. Furthermore we hypothesize, that we can develop an interactive data exploration system, which provides meaningful results in sub-seconds even for complex ML pipelines over very large data sets. The techniques will not only make ML more accessible to a broader range of users, but also ultimately enable more discoveries compared to any batch-driven approach.