Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty

04 Apr
Friday, 04/04/2014 5:00am to 7:00am
Ph.D. Dissertation Proposal Defense

Liping Peng

Computer Science Building, Room 142

Data management is becoming increasingly important in many applications, in particular, in scientific databases where data is naturally modeled by continuous random variables and queries can involve complex predicates and be difficult for users to express explicitly, especially in the face of an extremely large scientific databases like Large Synoptic Survey Telescope (LSST) and Sloan Digital Sky Survey (SDSS). My thesis work aims to provide efficient support to both the "data uncertainty" and "query uncertainty".

When data is uncertain, an important class of queries uses complex selection and join predicates and requires query answers to be returned if their existence probabilities pass a threshold. We start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting  joins using new indexes on uncertain data, (ii) expediting selections by reducing dimensionality of integration and using  faster filters, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of our techniques and significant performance gains over a state-of-the-art threshold query optimizer.

Next we address uncertain data management in the array model, which has gained popularity for scientific data processing recently due to performance benefits. Array databases may involve both "value uncertainty" within individual tuples and "position uncertainty" regarding where a tuple should belong in an array given uncertain dimension attributes. In our work, we define the formal semantics of array operations on uncertain data involving both types of uncertainty. To address the new challenge raised by position uncertainty, we propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that our best-performing techniques outperform baselines often by a wide margin while incurring only small storage overhead.

Finally, as data volumes of scientific applications and the user community continue to grow, users may not be able to express their data interests precisely. This leads to a strong need for "interactive data exploration" which can navigate users through a subspace of the large data set. We propose an automated query steering framework, which is able to automatically learn user interests and infer "classification" queries that retrieve data relevant to the user interest, by iteratively and incrementally incorporating users' feedbacks on strategically collected data samples into classification models such as support vector machines.

Advisor: Yanlei Diao