PhD Thesis Defense: Nelson Evbarunegbe, Machine Learning Techniques for Molecular Property Prediction and Applications to Mycomembrane Permeation
Content
Speaker:
Abstract:
The ability to accurately predict molecular properties aids modern drug discovery, offering a computational route to identify compounds with desired biological and physicochemical properties before costly laboratory experiments. Among the most critical yet underexplored properties is the ability of small molecules to permeate the complex mycomembrane of Mycobacterium tuberculosis (Mtb) — the causative agent of tuberculosis. Tuberculosis remains one of the world’s deadliest infectious diseases, and the scarcity of new anti-tubercular agents is largely attributed to the slow growth of the pathogen and the difficulty in identifying molecules that can successfully penetrate its unique cell envelope.
This dissertation explores machine learning techniques for molecular property prediction, with a focus on modeling and understanding mycomembrane permeability. First, we introduced MycoPermeNet, a graph-based deep learning model designed to learn the intrinsic relationship between molecular structure and permeability across the Mtb membrane. MycoPermeNet is trained on a first-of-its-kind dataset of permeability measurements for a collection of 1,558 small molecules. MycoPermeNet not only achieves robust predictive performance but also provides interpretable chemical insights, identifying key scaffolds and molecular fragments that promote mycomembrane permeability.
Recognizing potential limitations in generalizing to out-of-distribution and chemically novel compounds, we extended our approach with MycoPermeNet v2 for the second part of this dissertation, which integrates multi-level feature representations combining global and local molecular information. This enhanced architecture significantly improves prediction accuracy and generalization. We validated the generalizability of our approach by demonstrating superior performance across multiple benchmark datasets from the MoleculeNet library.
For the third part of this dissertation, we address a challenge of active learning in molecular property prediction within low-data settings such as the mycomembrane permeability task. Active learning is only as effective as the information it extracts from the molecular representations provided, which can limit overall performance. Motivated by this limitation, we developed ActiveFusion, an active learning framework for molecular property prediction which systematically evaluates different molecular representations and their fusion within iterative learning workflows. We showed that fusing graph-based embeddings with physicochemical descriptors improves predictive performance and discovery within active learning pipelines compared to the representations when used on their own, and these gains generalize across multiple molecular property benchmarks.
Collectively, this dissertation improves our ability to model, interpret, and leverage molecular properties by using chemical and biological domain knowledge for modeling, interpreting, and leveraging molecular properties, with understanding mycobacterial permeability as a case study. The methodologies developed herein — encompassing graph-based modeling, feature fusion, and active learning — advance both the predictive power and interpretability of machine learning-driven drug discovery, contributing a valuable foundation for rational antibiotic compound design.
Advisor:
Anna Green