Faculty Recruiting Support CICS

Incremental Non-Greedy Clustering at Scale

08 Jan
Friday, 01/08/2021 12:00pm to 2:00pm
Zoom Meeting
PhD Dissertation Proposal Defense
Speaker: Nicholas Monath

Zoom Meeting:  https://umass-amherst.zoom.us/j/94854311403

Meeting ID: 948 5431 1403

Abstract:

Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions and deletions of data, (3) support for any user-specified similarity functions. 

Hierarchical clustering is often desired because it represents multiple alternative clusters (e.g., at different granularities), providing for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all of these properties. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or tend to over merge clusters resulting in less than desired accuracies. 

In this thesis, we present incremental and scalable methods for clustering to empirically satisfy the above criteria. A method would need to represent uncertainty and meaningful alternative clusterings, efficiently and effectively reconsider past decisions in the incremental case, and also use parallelism to scale to massive datasets in the batch setting. Our method, Grinch, handles incrementally arriving data, in a non-greedy fashion, by applying tree structure re-arrangements based on any specified similarity function to reconsider past decisions. To utilize parallel processing for scalability to massive datasets, our method, SCC, operates in a round-based bottom-up manner, making certain decisions independently in parallel within each level to achieve efficiency. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping (not necessarily nested) clusters. We introduce a parallelizable bottom-up method for DAG-structured clustering, Llama. In our proposed work, we extend SCC and Llama with incremental addition/deletion algorithms inspired by the re-arrangement operations in Grinch. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on the largest clustering benchmarks in both the batch and the incremental setting (including adversarial data orderings). 

Advisor: Andrew McCallum