PhD Thesis Defense: Erica Cai, From Text to Networks: Enabling and Investigating Social Measurement via Low-Resource Knowledge Graph Extraction
Content
Speaker
Abstract
This thesis addresses the challenge of extracting structured instances of action or relationship occurrences from large volumes of unstructured text to populate knowledge graphs (KGs). In KGs, nodes represent entities mentioned in the text (e.g., Portugal, the United Kingdom, protein) and edges represent relationships or events (e.g., ally, payment). Extracted KGs allow researchers to perform various downstream analyses, such as identifying central nodes that indicate important entities in intelligence reports or examining comembership density in affiliation networks of elites. However, research literature shows that information extraction methods often struggle to perform well in low-resource settings, which limits their effectiveness on massive text data in real world settings. To overcome these limitations, this thesis introduces (1) new information extraction methods that perform effectively and efficiently in low resource settings, together with improved methods for their evaluation, and (2) a study of how errors introduced during KG construction can affect the validity of downstream analyses over extracted graphs.
In the first part of the thesis, we focus on information extraction methods that extract tuple structures which populate graphs with entities (e.g., Sherlock Holmes, John Watson) and relationships between them (e.g., friend, enemy). We contribute methods and improvement of evaluation for two key tasks: (1) Event extraction: We (a) propose an interpretable, efficient approach to extracting event structures from text that outperforms state-of-the-art methods; (b) introduce and apply a variation of this method over millions of news articles to investigate bias in global news coverage of critical disaster and terrorist attack events; (c) provide recommendations for and implementations of fixes for issues related to evaluating such methods. (2) Named entity recognition and relation extraction: (a) We develop a few-shot method for extracting fine-grained named entities (e.g., religious institution, soldier, politician) that achieves state-of-art performance and (b) propose solutions to challenges in evaluating relation extraction methods due to issues in label assignment methods for datasets.
The second part of the thesis investigates how extraction errors affect the quality and reliability of downstream knowledge graph analyses. These analyses often rely on metrics such as centrality, projection network density, and clustering coefficients, which are crucial for capturing node importance and how nodes tend to cluster. To overcome the shortage of realistic evaluation datasets, we introduce a novel resource: a collection of complete scanned books, some spanning up to 700 pages, each paired with large, meticulously labeled knowledge graphs. Using this benchmark, we assess KG extraction performance across a diverse set of models, conduct detailed analyses of realistic error sources, and examine patterns for how these errors affect downstream graph metrics that are relevant to practical applications. Our experiments identify a crucial performance threshold above which biases across most graph analysis metrics are very small, but also show that as KG extraction quality decreases, consistent patterns of over- and underestimation emerge across these metrics. Further simulation studies demonstrate that commonly used error models frequently fail to reproduce the bias patterns observed in real extraction settings, emphasizing the need for more sophisticated and heterogeneous models to accurately capture error propagation. Together, these findings equip practitioners with actionable insights and underscore the importance of advancing both information extraction methods and error modeling to support trustworthy, meaningful graph analysis.
Advisor
Brendan O'Connor