Improving and Understanding Data Quality in Large-scale Data Systems

08 Dec
Friday, 12/08/2017 11:00am to 1:00pm
Lederle Graduate Research Center, Room A310
Ph.D. Dissertation Proposal Defense
Speaker: Xiaolan Wang

"Improving and Understanding Data Quality in Large-scale Data Systems"

Systems and applications rely heavily on data, which makes data quality a critical factor for their function. In turn, low quality data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is more important to improve the processes that produce the data, to collect good data sources that are used for generating the data, and to truly understand the quality of the data. Therefore, the objective of this thesis is to improve and understand data quality from the above aspects.

First, we develop two efficient and effective tools, DataXRay and QFix, that are able to diagnose systematic errors in general data extraction systems and relational data systems respectively. Second, we design a recommendation system, MIDAS, that focuses on identifying high quality data sources for augmenting knowledge bases. Third, in my proposed future work, the goal is to develop an explanation tool for understanding the differences of data that are generated from similar processes and sources. 

Advisor: Alexandra Meliou