Data Systems Group at MIT

Northstar - Making Data Science More Interactive

Unleashing the potential of Big Data for a broader range of users requires a paradigm shift in the algorithms and tools used to analyze data.  Exploring complex datasets needs more than a simple question-and-response interface. Ideally, the user and the system would engage in a “conversation,'' each party contributing what it does best. The user can contribute judgment and direction, while the machine can contribute its ability to process massive amounts of data, perhaps even predicting what the user might require next. However, even with sophisticated visualizations, digesting and interpreting large, complex datasets often exceeds human capabilities.

Machine learning (ML) and statistical techniques can help in these situations by providing tools that clean, filter and identify relevant data subsets. Unfortunately, support for ML is all too often added as an afterthought: the techniques are buried in black boxes and executed in an all-or-nothing manner. Results can often take hours to compute, which is unacceptable for interactive data exploration. Moreover, users want to see the result as it evolves. They want to interrupt, change the parameters, features or even the whole pipeline. Meanwhile, data scientists are still using text-style batch interfaces from the 80s.

Northstar includes four main components:

  1. Vizdom: a novel visual data exploration environment specifically designed for pen and touch interfaces, such as the Microsoft Surface Hub.
  2. IDEA: an intelligent cache and streaming approximation engine, which enables users to analyze data and create ML pipelines with immediate feedback over any type of data source and independent of the data size.
  3. QUDE, which monitors every interaction the user does and tries to warn about common mistakes and problems.
  4. Alpine Meadow:  a ”query” optimizer for machine learning that allows users to declaratively indicate what they want (e.g., “predict label X”) while the system automatically figures out the best ML pipeline (i.e., plan) to achieve that goal.

Participants

Citations

Tim Kraska. (2018) Northstar: An Interactive Data Science System. PVLDB 11(12): 2150-2164.

Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. (2018). Towards Interactive Curation & Automatic Tuning of ML Pipelines. DEEM@SIGMOD 2018: 1:1-1:4