PhD Dissertation Proposal: Miguel Fuentes, Synthetic Data with Applications to Privacy and Ecology
Content
Speaker:
Abstract:
Reconstructing complex, global probability distributions from restricted data is a significant challenge in settings where access to individual-level data is limited. These restrictions can be policy-based, to protect sensitive human data, or physical, due to the logistical difficulty of data collection, as with animal tracking studies. This dissertation demonstrates that these two distinct fields—differential privacy and computational ecology—can be addressed through a unified methodological framework. In both domains, while individual data is unavailable, aggregate population-level statistics—specifically marginal distributions—are accessible. The central task developed in this work is to learn a global probabilistic model, often a graphical model, that is consistent with these observed marginals.
In the domain of differential privacy, this framework is applied to two fundamental, related tasks: privately answering marginal queries and generating synthetic data. To improve the scalability of adaptive mechanisms for marginal query answering, we introduce algorithmic advances that efficiently integrate residual-based decomposition into iterative frameworks. This accelerates the reconstruction of query answers and optimizes privacy budget allocation, though it does not itself produce a synthetic dataset. For the task of synthetic data generation, we develop a joint selection framework that adaptively incorporates public data by intelligently choosing between costly private measurements and potentially biased public ones.
These same marginal-based modeling principles are then applied to computational ecology. We introduce BirdFlow, a novel probabilistic framework that infers individual animal movement from aggregate citizen science data. By treating weekly species abundance estimates as marginals of a dynamic process, the framework learns a probabilistic movement model consistent with these aggregate observations. This learned model serves a dual purpose: it can be used to perform inference and make probabilistic forecasts of migratory patterns, and it can generate synthetic full-annual-cycle migratory trajectories, enabling analyses where individual tracking data is sparse.
Collectively, this research demonstrates that a unified focus on marginal-based modeling provides a robust and scalable paradigm. This approach yields solutions to fundamental challenges where real data are sparse or sensitive, enabling both accurate, privacy-preserving query answering and the generation of high-fidelity synthetic data.
Advisor:
Dan Sheldon