Skip to main content
UMass Collegiate M The University of Massachusetts Amherst
  • Visit
  • Apply
  • Give
  • Search UMass.edu
Manning College of Information & Computer Sciences

Main navigation

  • Academics

    Programs

    Undergraduate Programs Master's Programs Doctoral Program Graduate Certificate Programs

    Academic Support

    Advising Career Development Academic Policies Courses Scholarships and Fellowships
  • Research

    Research

    Research Areas Research Centers & Labs Undergraduate Research Opportunities

    Faculty & Researchers

    Faculty Directory Faculty Achievements

    Engage

    Research News Distinguished Lecturer Series Rising Stars in Computer Science Lecture Series
  • Community

    On-Campus

    Diversity and Inclusion Student Organizations Massenberg Summer STEM Program Awards Programs Senior Celebration

    External

    Alumni Support CICS
  • People
    Full A-Z Directory Faculty Staff
  • About

    Overview

    College Overview Leadership Our New Building

    News & Events

    News & Stories Events Calendar

    Connect

    Visiting CICS Contact Us Employment Offices & Services
  • Info For
    Current Undergraduate Students Current Graduate Students Faculty and Staff Newly Accepted Undergraduate Students

PhD Thesis Defense: Erica Cai, From Text to Networks: Enabling and Investigating Social Measurement via Low-Resource Knowledge Graph Extraction

Content

Tuesday, May 13, 2025, 11:00 AM - Tuesday, May 13, 2025, 1:00 PM

Hybrid
PhD Thesis Defense
Presentation

Speaker

Erica Cai

Abstract

This thesis addresses the challenge of extracting structured instances of action or relationship occurrences from large volumes of unstructured text to populate knowledge graphs (KGs). In KGs, nodes represent entities mentioned in the text (e.g., Portugal, the United Kingdom, protein) and edges represent relationships or events (e.g., ally, payment). Extracted KGs allow researchers to perform various downstream analyses, such as identifying central nodes that indicate important entities in intelligence reports or examining comembership density in affiliation networks of elites. However, research literature shows that information extraction methods often struggle to perform well in low-resource settings, which limits their effectiveness on massive text data in real world settings. To overcome these limitations, this thesis introduces (1) new information extraction methods that perform effectively and efficiently in low resource settings, together with improved methods for their evaluation, and (2) a study of how errors introduced during KG construction can affect the validity of downstream analyses over extracted graphs.

In the first part of the thesis, we focus on information extraction methods that extract tuple structures which populate graphs with entities (e.g., Sherlock Holmes, John Watson) and relationships between them (e.g., friend, enemy). We contribute methods and improvement of evaluation for two key tasks: (1) Event extraction: We (a) propose an interpretable, efficient approach to extracting event structures from text that outperforms state-of-the-art methods; (b) introduce and apply a variation of this method over millions of news articles to investigate bias in global news coverage of critical disaster and terrorist attack events; (c) provide recommendations for and implementations of fixes for issues related to evaluating such methods. (2) Named entity recognition and relation extraction: (a) We develop a few-shot method for extracting fine-grained named entities (e.g., religious institution, soldier, politician) that achieves state-of-art performance and (b) propose solutions to challenges in evaluating relation extraction methods due to issues in label assignment methods for datasets.

The second part of the thesis investigates how extraction errors affect the quality and reliability of downstream knowledge graph analyses. These analyses often rely on metrics such as centrality, projection network density, and clustering coefficients, which are crucial for capturing node importance and how nodes tend to cluster. To overcome the shortage of realistic evaluation datasets, we introduce a novel resource: a collection of complete scanned books, some spanning up to 700 pages, each paired with large, meticulously labeled knowledge graphs. Using this benchmark, we assess KG extraction performance across a diverse set of models, conduct detailed analyses of realistic error sources, and examine patterns for how these errors affect downstream graph metrics that are relevant to practical applications. Our experiments identify a crucial performance threshold above which biases across most graph analysis metrics are very small, but also show that as KG extraction quality decreases, consistent patterns of over- and underestimation emerge across these metrics. Further simulation studies demonstrate that commonly used error models frequently fail to reproduce the bias patterns observed in real extraction settings, emphasizing the need for more sophisticated and heterogeneous models to accurately capture error propagation. Together, these findings equip practitioners with actionable insights and underscore the importance of advancing both information extraction methods and error modeling to support trustworthy, meaningful graph analysis.

Advisor

Brendan O'Connor

Hybrid event posted in PhD Thesis Defense

More link

Join via Zoom

Site footer

Manning College of Information & Computer Sciences
  • Find us on Facebook
  • Find us on YouTube
  • Find us on LinkedIn
  • Find us on Instagram
  • Find us on Flickr
  • Find us on Bluesky Social
Address

140 Governors Dr
Amherst, MA 01003
United States

  • Visit CICS
  • Give
  • Contact Us
  • Employment
  • Events Calendar
  • Offices & Services

Info For

  • Current Undergraduate Students
  • Current Graduate Students
  • Faculty & Staff
  • Newly Accepted Undergraduate Students

Global footer

  • ©2025 University of Massachusetts Amherst
  • Site policies
  • Privacy
  • Non-discrimination notice
  • Accessibility
  • Terms of use