PhD Dissertation Proposal: Tessa Masis, Mainstream Englishes Are Not the Only Fruit: Towards Reasonable Understanding of Multilingual User-Generated Text
Content
Speaker:
Abstract:
As social media and, more recently, LLM-based conversational agents have become increasingly ubiquitous and integrated into daily life, it has become more essential to understand user speech and behavior on these platforms in order to know the impacts of the platforms on society and to inform improvements for such platforms. However, most of the research in this line of work has focused on tools and analysis for Mainstream English (ME) data. This narrow focus means that the behavior of nonstandard English and non-English-speaking users is often limited or unknown, and our understanding of user behavior more broadly can be distorted. This thesis addresses the need for better understanding of non-ME user-generated data by (1) developing scalable NLP tools for such data, and (2) conducting analyses of real-world multilingual data in social media and user-LLM interactions.
I begin by presenting CGEdit, a human-in-the-loop method to generate contrast sets for morphosyntactic feature detection in low-resource Englishes, which are often found on social media. Feature detection is commonly used in studies of language variation to examine how identity and social contexts affect language use, and our method enables accurate study of variation within nonstandard Englishes. Demonstratively, I conduct the first national-level analysis of morphosyntactic variation in African American Language (AAL) by using CGEdit-trained feature detectors on a dataset of 227M geotagged tweets. These findings capture AAL morphosyntactic variation at unprecedented scale and detail, and point to specific underrepresented speech communities needing further study.
I then introduce the largest and first multilingual analysis of the online #StopAsianHate movement, leveraging a combination of topic modeling, user modeling, and hand annotation. Here, I characterize significant differences between English and non-English topics in the hashtag and identify the considerable impact of global users (i.e. K-pop fans) on sustaining the online movement. Next, I introduce UserGeo, a geo-entity linking method for real-world multilingual social media data. The location of social media users is valuable for many computational social science tasks, as demonstrated in the #StopAsianHate and AAL analyses, but the lack of geo-entity linking tools for non-English social media data can greatly limit which users are studied. UserGeo represents locations as averaged embeddings from labeled user-input location names and, unlike previous methods, enables selective prediction via an interpretable confidence score. Both effective and efficient, it achieves state-of-the-art performance on a global dataset.
In proposed work, I will present the first analysis of privacy disclosures in non-English user interactions with LLM-based conversational agents (e.g. the WildChat dataset). This analysis will reveal patterns in privacy disclosure type and frequency across languages, and may confirm previous work suggesting cross-cultural differences in privacy concerns. Lastly, I will adapt these privacy disclosure annotations to create effective training data for a multilingual prompt sanitization tool, to help non-English speakers easily rewrite their prompts and reduce overly personal disclosures.
Advisor:
Brendan O'Connor