Modern organizations are faced with a massive number of heterogeneous data sets. It’s not uncommon for a large enterprise to have 10,000 or more structured databases, not to mention millions of spreadsheets, text documents, and emails. Typically these databases do not have a common schema. As a result, data scientists in large organizations spend 90% or more of their time finding data they need and then cleaning and integrating it.
In conjunction with the Qatar Computing Research Institute (QCRI), we have built Data Civilizer, a workflow system with a collection of discovery, cleaning and integration tools that can help users with their data integration issues. We have started working with two groups at Massachusetts General Hospital (MGH) with serious integration issues. To accommodate their needs, we have begun work on Data Civilizer 2.0, which expands our system with user-defined modules, including machine learning ones, along with a sophisticated visualization tool and a visual data debugger.
In addition, we have found that user interest often focuses only on a fraction of the data (the hot data). In this case, cleaning all the data is a waste of resources, or even impracticable if resources are limited. To overcome this issue we are studying new techniques for cleaning only the portion of data that is actually useful for the analysis at hand.
Raul Castro-Fernandez, Sam Madden, Elkindi Rezig, Giovanni Simonini, Michael Stonebraker