Content

Speaker:

Ankita Gupta

Abstract:

A wide range of social contexts such as public policy, law and health require consulting large collections of texts for high-stakes decision-making. These decisions can have a significant and lasting impact on individuals as well as society. For instance, policymakers must incorporate public feedback before passing important health regulations or legal practitioners must research prior judicial opinions before making informed decisions and recommendations for their clients. However, it is impossible to manually read all the relevant documents, as they can span thousands of public comments and millions of judicial opinions. Existing automated content analysis techniques can help, but are often limited to categorical sentiment labels, failing to capture how social data richly intertwines arguments through personal narratives and other semantic relations.

In this thesis, I will present computational methods for transforming large-scale, unstructured text data into rich semantic structures to support human decision-making, with a focus on methods for understanding argumentative texts. Throughout these studies, I use graphs to represent and understand argumentative texts. I also explore the tradeoff between annotator effort and computational resources required for sophisticated argument analysis at scale.

First, I introduce the task of epistemic stance detection to identify whose beliefs are cited to build arguments, collect human annotations to train neural models, and apply them to study citation practices among U.S. political opinion elites.

Second, I demonstrate how LLMs can extract individual arguments from thousands of public comments on health policy without human supervision, and how these arguments can be aggregated into a corpus-level visual summary to provide insights for policymakers.

Third, while LLMs are effective, their high computational cost and privacy concerns remain significant barriers for practitioners in high-stakes applications. Thus, to enable development of efficient open-source models for argument analysis at scale, I turn to the legal domain, where highly standardized argumentative writing conventions make it amenable to curation of large-scale high-quality supervision without manual effort. In particular, I present a data collection approach that semi-automatically mines a large dataset of naturally occurring expert-annotated argument pairs with stances, and explores their effectiveness for improving small open-source models compared to zero-shot approaches.

Finally, I show how the rich structure of legal argumentative writing can further serve as a supervision signal for building better retrieval models, enabling practitioners to navigate millions of judicial opinions to find relevant precedents. I conclude by outlining future directions for trustworthy and collaborative AI that can help humans reason through complex social decisions.

Advisor:

Brendan O'Connor