Faculty Recruiting Support CICS

Low Resource Information Extraction for Complex Networks

25 Jun
Tuesday, 06/25/2024 11:00am to 1:00pm
PhD Dissertation Proposal Defense
Speaker: Erica Cai

Extracting structured instances of action or relationship occurrences from unstructured text could help to populate a knowledge graph, where nodes correspond to entities in text, e.g., Portugal, United Kingdom, protein, and edges correspond to relationship or event occurrences between the entities in the text, e.g., ally or payment respectively. Given the knowledge graph that depicts relationships or interactions, researchers can perform various downstream analyses -- e.g., identification of central nodes in the knowledge graph, where the central nodes indicate important entities in intelligence report text; analysis of transitivity in elite networks. Unfortunately, research literature shows that information extraction methods struggle to perform well in low resource settings. Therefore, my thesis aims to (1) introduce information extraction methods that perform well in low resource settings and improve the evaluation of such methods. Then (2) it focuses on, given the error in populating a knowledge graph using these information extraction methods, how this error affects downstream analyses on the knowledge graph.
In the first part of the thesis, we focus on information extraction methods that extract tuple structures which populate graphs with entities from text, e.g., Sherlock Holmes, John Watson, that map to nodes and relationships  between the entities, e.g., friend, enemy, that map to edges. We contribute methods and improvement of evaluation for (1) event extraction, where we (a) proposed an interpretable and efficient method to extract event structures from text that outperforms state-of-art and (b) provided implementations of fixes for issues related to evaluating the methods. We contribute methods and improvement of evaluation in (2) named entity recognition and relation extraction, where we (a) proposed a few-shot method for extracting fine-grained named entities (e.g. religious institution, soldier, politician) from text that outperforms state-of-art and (b) proposed ways to overcome various challenges related to evaluating relation extraction methods due to issues in label assignment methods for datasets.
Given errors from information extraction methods, we next explore the impact of the errors on measurement of centrality, global clustering coefficient, and transitivity over learned graphs, where these analyses help to determine importance of nodes and tendency of nodes to cluster. We begin by investigating whether analysis over any synthetic network types could serve as a proxy for analysis over realistic networks. Next, we introduce errors in real-world and synthetic graphs ranging from less realistic (random) to more realistic (node disaggregation/preferential attachment). We provide closed forms on how errors affect measurement of global clustering coefficient and transitivity on simpler synthetic networks, and conduct simulations to observe how errors affect these measurements on real-world networks. Our findings inform social scientists and biomedicine researchers about how different types and magnitudes of errors from the information extraction step may affect downstream analyses over learned networks.

Advisor: Brendan O'Connor

Join via Zoom