Skip to main content
The University of Massachusetts Amherst
  • Visit
  • Apply
  • Give
  • Search UMass.edu
Manning College of Information & Computer Sciences

Main navigation

  • Academics

    Programs

    Undergraduate Programs Master's Programs Doctoral Program Graduate Certificate Programs

    Curriculum

    Academic Policies Courses

    Academic Support

    Advising Career Development Scholarships and Fellowships Commencement
  • Research

    Research

    Research Areas Research Centers & Labs Undergraduate Research Opportunities

    Faculty & Researchers

    Faculty Directory Faculty Achievements Turing Award

    Engage

    Research News Distinguished Lecturer Series Rising Stars in Computer Science Lecture Series
  • Community

    On-Campus

    Community, Outreach, and Organizational Learning Student Organizations Massenberg Summer STEM Program Awards Programs

    External

    Alumni Support CICS
  • People
    Full A-Z Directory Faculty Staff
  • About

    Overview

    College Overview Leadership Our New Building

    News & Events

    News & Stories Events Calendar Significant Bits Magazine

    Connect

    Visiting CICS Contact Us Employment Offices & Services
  • Info For
    Current Undergraduate Students Current Graduate Students Faculty and Staff Newly Accepted Undergraduate Students

PhD Dissertation Proposal: Tessa Masis, Mainstream Englishes Are Not the Only Fruit: Towards Reasonable Understanding of Multilingual User-Generated Text

Content

Thursday, April 9, 2026, 1:30 PM - Thursday, April 9, 2026, 3:30 PM

Online
PhD Dissertation Proposal Defense

Speaker:

Tessa Masis

Abstract:

As social media and, more recently, LLM-based conversational agents have become increasingly ubiquitous and integrated into daily life, it has become more essential to understand user speech and behavior on these platforms in order to know the impacts of the platforms on society and to inform improvements for such platforms. However, most of the research in this line of work has focused on tools and analysis for Mainstream English (ME) data. This narrow focus means that the behavior of nonstandard English and non-English-speaking users is often limited or unknown, and our understanding of user behavior more broadly can be distorted. This thesis addresses the need for better understanding of non-ME user-generated data by (1) developing scalable NLP tools for such data, and (2) conducting analyses of real-world multilingual data in social media and user-LLM interactions. 

I begin by presenting CGEdit, a human-in-the-loop method to generate contrast sets for morphosyntactic feature detection in low-resource Englishes, which are often found on social media. Feature detection is commonly used in studies of language variation to examine how identity and social contexts affect language use, and our method enables accurate study of variation within nonstandard Englishes. Demonstratively, I conduct the first national-level analysis of morphosyntactic variation in African American Language (AAL) by using CGEdit-trained feature detectors on a dataset of 227M geotagged tweets. These findings capture AAL morphosyntactic variation at unprecedented scale and detail, and point to specific underrepresented speech communities needing further study.

I then introduce the largest and first multilingual analysis of the online #StopAsianHate movement, leveraging a combination of topic modeling, user modeling, and hand annotation. Here, I characterize significant differences between English and non-English topics in the hashtag and identify the considerable impact of global users (i.e. K-pop fans) on sustaining the online movement. Next, I introduce UserGeo, a geo-entity linking method for real-world multilingual social media data. The location of social media users is valuable for many computational social science tasks, as demonstrated in the #StopAsianHate and AAL analyses, but the lack of geo-entity linking tools for non-English social media data can greatly limit which users are studied. UserGeo represents locations as averaged embeddings from labeled user-input location names and, unlike previous methods, enables selective prediction via an interpretable confidence score. Both effective and efficient, it achieves state-of-the-art performance on a global dataset. 

In proposed work, I will present the first analysis of privacy disclosures in non-English user interactions with LLM-based conversational agents (e.g. the WildChat dataset). This analysis will reveal patterns in privacy disclosure type and frequency across languages, and may confirm previous work suggesting cross-cultural differences in privacy concerns. Lastly, I will adapt these privacy disclosure annotations to create effective training data for a multilingual prompt sanitization tool, to help non-English speakers easily rewrite their prompts and reduce overly personal disclosures.

Advisor:

Brendan O'Connor

Online event posted in PhD Dissertation Proposal Defense for Faculty and Current students

More link

Join via Zoom

Site footer

Manning College of Information & Computer Sciences
  • Find us on Facebook
  • Find us on YouTube
  • Find us on LinkedIn
  • Find us on Instagram
  • Find us on Flickr
  • Find us on Bluesky Social
Address

140 Governors Dr
Amherst, MA 01003
United States

  • Visit CICS
  • Give
  • Contact Us
  • Employment
  • Events Calendar
  • Offices & Services

Info For

  • Current Undergraduate Students
  • Current Graduate Students
  • Faculty & Staff
  • Newly Accepted Undergraduate Students
University of Massachusetts Amherst
  • ©2026 University of Massachusetts Amherst
  • Site policies
  • Privacy
  • Non-discrimination notice
  • Accessibility
  • Terms of use