Faculty Recruiting Support CICS

Social Measurement and Causal Inference with Text

17 Aug
Tuesday, 08/17/2021 10:00am to 12:00pm
Zoom Meeting
PhD Thesis Defense
Speaker: Katie Keith


The digital age has dramatically increased access to large-scale collections of digitized text documents. These corpora include, for example, digital traces from social media, decades of archived news reports, and transcripts of spoken interactions in political, legal, and economic spheres. For social scientists, this new widespread data availability has potential for improved quantitative analysis of relationships between language use and human thought, actions, and societal structure. However, the large-scale nature of these collections means that traditional manual approaches to analyzing content are extremely costly and do not scale. Furthermore, incorporating unstructured text data into quantitative analysis is difficult due to texts' high-dimensional nature and linguistic complexity.

This thesis blends (a) the computational strengths of natural language processing (NLP) and machine learning to automate and scale-up quantitative text analysis with (b) two themes central to social scientific studies but often under-addressed in NLP: measurement---creating quantifiable summaries of empirical phenomena---and causal inference---estimating the effects of intervention and counterfactuals. First, we address measuring class prevalence in document collections; we contribute a generative probabilistic modeling approach to prevalence estimation and show empirically that our model is more robust to shifts in class priors between training and inference. Second, we examine cross-document entity-event measurement; we contribute an empirical pipeline and EM-based distant supervision approach to identify the names of civilians killed by police from our corpus of web-scraped news reports. Third, we gather and categorize applications that use text to reduce confounding from causal estimates and contribute a list of open problems as well as guidance about data processing and evaluation decisions in this area. Finally, we contribute a new causal research design to estimate the natural indirect and direct effects of social group signals (e.g. race or gender) on conversational outcomes with separate aspects of language as causal mediators, motivated by a theoretical case study of the effect of an advocate's gender on interruptions from justices in U.S. Supreme Court oral arguments. We conclude by discussing the relationship between measurement and causal inference with text and future work at this intersection. 

Advisor: Brendan O'Connor