Learning with Aggregate Data

22 Aug
Add to Calendar
Wednesday, 08/22/2018 2:00pm to 4:00pm
Computer Science Building, Room 151
Ph.D. Thesis Defense
Speaker: Tao Sun

Aggregate data naturally comes in many  research domain and in different forms. It obfuscates either the input data or the output supervision (aka the label), and throws in collective information that is difficult or impossible to tear apart individual piece of information. Nevertheless, aggregate data can be extremely invaluable, even with extra constraints. In this work, we will explore how to exploit aggregate data, either as noisy aggregate input statistics, or as aggregate output supervision, to build models for different applications. 

First, we study the Collective Graphical Models (CGMs), where only noisy aggregate observations are available. There are many applications in ecological studies and information-sensitive domains: ecologists use count statistics (e.g., the number of animals trapped, seen, heard, or otherwise detected) to estimate species abundance, distributions, death rates, etc., or infer arrivals, departures and population size of transient species, just to name a few. In clinical, census, and human mobility domains, data publishers anonymize aggregate data before release to protect privacy.  In this work, we will show how to build graphical models given these data. We proved the exact inference problem is NP-hard, even for trees. We proposed several approximate inference algorithms, and by solving the inference problem, we are able to build a model for large-scale bird migration, and one for collective human mobility under the differential privacy setting. 

Second, we study learning with aggregate supervisions. The most common learning scenario is when given bags of instances and bag-level supervisions. There are applications in many domains: classifying point clouds in computer vision, estimating red-shift of galaxy clusters, estimating statistics of several populations, predicting voting preferences of different voting districts and/or demographic groups, etc.

We will separate the study to two parts. The first part learns an instance-level model. We want to develop individual voter models based on publicly available precinct-level data. We proposed a probabilistic, Learning with Label Propagations (LLPs) based model. We also proposed to use the cardinality potentials to perform exact inference for this model. 

The second part learns a bag-level model. We focused on distribution regression, which is a very general learning framework and encompass many other learning problems (e.g., multi-instance learning, ecological inference, etc.) as special cases. We empirically evaluated both "fixed-embedding" strategy and "learned-embedding" strategy, identified key elements of distribution regression, and showed how a number of different methods can be applied to it. We evaluated on three tasks: estimating population statistics, point cloud classification, and predicting voting preferences in 2012 and 2016 US presidential elections.

Advisor: Daniel Sheldon