Faculty Recruiting Support CICS

Social Measurement and Causal Inference with Text

17 Aug
Add to Calendar
Tuesday, 08/17/2021 10:00am to 12:00pm
Zoom Meeting
PhD Thesis Defense
Speaker: Katie Keith


The digital age has dramatically increased access to large-scale collections of digitized text documents. These corpora include, for example, digital traces from social media accounts, decades of archived news reports, and transcripts of spoken interactions in political, legal, and economic spheres. For social scientists, this new widespread data availability has potential for improved quantitative analysis of relationships between language use and human thought, actions, and societal structure. However, the large-scale nature of these collections means that traditional manual approaches to analyzing content are extremely costly and do not scale. Furthermore, incorporating unstructured text data into quantitative analysis is difficult due to the high-dimensional nature of text and its linguistic complexity.

This thesis blends (a) the computational strengths of natural language processing and machine learning to automate and scale-up text analysis with (b) two themes central to social scientific studies: measurement---creating quantifiable summaries of empirical phenomena---and causal inference---estimating the effects of intervention and counterfactuals. First, we address measuring class prevalence in document collections; we contribute a generative probabilistic modeling approach to prevalence estimation and show empirically that our model is more robust to shifts in class priors between training and inference. Second, we examine cross-document entity-event measurement; we contribute an empirical pipeline and EM-based distant supervision approach to identify the names of civilians killed by police from our corpus of web-scraped news reports. Third, we gather and categorize applications that use text to reduce confounding from causal estimates and contribute a list of open problems as well as guidance about data processing and evaluation decisions in this area. Finally, we contribute a conceptual framework for language as a causal mediator in identity-based bias analysis and apply this framework to the example of gender bias in Supreme Court oral argument interruptions. We conclude by discussing the interconnectedness between measurement and causal inference  with text and future work for corpus-level empirical evaluation at this intersection.

Advisor: Brendan O'Connor