Content

Speaker:

Marisa Hudpseth

Abstract:

Most approaches in natural language processing are developed and validated primarily on English and other high-resource languages. As a result, their data pipelines, modeling choices, and evaluations often encode assumptions that do not hold in low-resource settings. In such settings, where performance cannot be improved by simply scaling up the training data or model size, it is often necessary to counteract these assumptions with language-aware methods.

This thesis examines such language-specific considerations through case studies on Latin. By improving data quality, evaluation practices, and modeling strategies, we advance Latin NLP applications in digital humanities, historical text analysis, and computational philology. At the same time, despite Latin being low-resource in terms of pretraining data, its long history of academic study has produced an abundance of expert-annotated linguistic resources. This unique mix makes Latin a useful testbed for developing general, linguistically informed methods for low-resource language modeling.

The first part of this thesis focuses on special considerations for the training and evaluation of Latin language models. First, we conduct a review of existing Latin treebanks, which are datasets of sentences annotated with each word’s’ part of speech, morphological features, and dependency relations. We harmonize inconsistent annotations and introduce more realistic data splits that better evaluate models’ cross-time generalizability. Next, we move beyond syntax-level evaluations by introducing RespondeoQA, a bilingual Latin-English question answering benchmark designed to assess cultural and linguistic knowledge of generative models. Finally, we investigate linguistically-informed modeling strategies for low-data settings. Our contextual, morphologically-guided tokenization method incorporates expert-curated linguistic resources into model training and consistently improves performance on four downstream tasks. Overall, these studies demonstrate the importance of linguistic, cultural, and historical awareness when training and evaluating low-resource language models.

The final chapter will apply these insights beyond Latin by exploring continual pretraining as a method to induce cross-lingual transfer to low-resource languages. Prior work in this area typically relies on English-dominant or massively multilingual base models as their starting point for continual pretraining. Although this setup is computationally more efficient, it overlooks other source languages potentially more beneficial than English and limits our ability to quantify the contribution of any particular source language. We propose a more targeted approach in which monolingual base models are trained in several high-resource languages and then continually pretrained on a diverse set of low-resource target languages, holding the size of source and target data constant. By comparing downstream performance across source-target language pairs, this work will evaluate the role of language similarity, script, linguistic features, and other factors in inducing positive transfer to low-resource languages.

Advisor:

Brendan O'Connor