Faculty Recruiting Support CICS

Compact Representations of Uncertainty in Clustering

18 Aug
Tuesday, 08/18/2020 3:00pm to 5:00pm
Zoom Meeting
PhD Thesis Defense
Speaker: Craig Greenberg

Zoom Meeting: https://umass-amherst.zoom.us/j/93934278492  

Meeting ID: 939 3427 8492


Flat clustering and hierarchical clustering are two fundamental tasks, often used to discover meaningful structures in data, such as subtypes of cancer, phylogenetic relationships, taxonomies of concepts, and cascades of particle decays in particle physics. When multiple clusterings of the data are possible, it is useful to represent uncertainty in clustering through various probabilistic quantities, such as the distribution over partitions or tree structures, and the marginal probabilities of subpartitions or subtrees.
Many compact representations exist for structured prediction problems, enabling the efficient computation of probability distributions, e.g., a trellis structure and corresponding Forward-Backward algorithm for Markov models that model sequences. However, no such representation has been proposed for either flat or hierarchical clustering models.  In this thesis, we present our work developing data structures and algorithms for computing probability distributions over flat and hierarchical clusterings, as well as for finding maximum a posteriori (MAP) flat and hierarchical clusterings, and various marginal probabilities, as given by a wide range of energy-based clustering models.

First, we describe a trellis structure that compactly represents distributions over flat or hierarchical clusterings. We also describe related data structures that represent approximate distributions. 
We then present algorithms that, using these structures, allow us to compute the partition function, MAP clustering, and the marginal probabilities of a cluster (and sub-hierarchy, in the case of hierarchical clustering) exactly.  We also show how these and related algorithms can be used to approximate these values, and analyze the time and space complexity of our proposed methods.
We demonstrate the utility of our approaches using various synthetic data of interest as well as in two real world applications, namely particle physics at the Large Hadron Collider at CERN and in cancer genomics.  We conclude with a brief discussion of future work.

Advisor: Andrew McCallum