Faculty Recruiting Support CICS

Robust and Fair Algorithms for Clustering with Applications to Data Integration

01 May
Friday, 05/01/2020 2:00pm to 4:00pm
Zoom Meeting: https://umass-amherst.zoom.us/ Meeting ID: 972 5409 4585 Password: sgumass
PhD Dissertation Proposal Defense

Abstract:

A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and fair methods for clustering cannot be overstated.

We tackle the various facets of clustering with a multi-pronged approach described below.

 (i) While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness of these methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing.  Additionally, we focus on scalability to handle web-scale datasets.

(ii) In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model to capture interactions between records that belong to different clusters and devise techniques for efficient cluster recovery.

(iii) The manifestation of bias in data could arise due to discriminatory treatment of marginalized groups, sampling methods or even measurement errors in the data. We study the impact of this bias on generated clusters and develop techniques that guarantee fair representation from different groups.

We prove the noise tolerance of our algorithms and back the theory by demonstrating the efficacy and efficiency on various real world datasets for these applications.

Advisor: Barna Saha