PhD Dissertation Proposal Defense: Katherine Thai, Modeling Literary Interpretation: Benchmarks and Methods for Machine Understanding of Literature
Content
Speaker
Abstract
As large language models (LLMs) grow increasingly capable, state-of-the-art model developers have turned their attention towards measuring whether models have met or even surpassed human abilities on tasks that require deep understanding and reasoning. Many existing evaluations focus on tasks such as solving math problems, answering factual questions, or following increasingly complex instructions. Literary analysis, long performed by human scholars, requires a different kind of reasoning: it is open-ended, context-rich, and contains cultural and stylistic nuance. This thesis argues that if LLMs are to engage meaningfully with the full spectrum of human language, they must also demonstrate proficiency in literary analysis.
This thesis explores how natural language processing (NLP) models can interpret literary texts, which demand the ability to process nuance, figurative language, and discourse-level meaning. Literary analysis poses distinct challenges for AI systems because of its reliance on both close textual reading and broad narrative understanding. To make this interpretive task computationally tractable, this thesis introduces novel datasets and evaluation frameworks that bridge literary scholarship and machine learning. First, it presents RELiC, a large-scale benchmark for literary evidence retrieval, where models must recover quotations from literary works based on surrounding critical analysis. Despite gains from dense retrieval methods, results reveal that the contemporary models struggle with the interpretive depth required by the task. Next, the thesis introduces Par3, a dataset of paragraph-aligned literary translations, and shows that expert translators strongly prefer human over machine outputs, citing not just mistranslations but failures in narrative flow and stylistic coherence. A post-editing model trained on Par3 improves output quality, offering a bridge between literal translation and interpretive rewriting. Finally, the thesis revisits the RELiC task in the age of long-context language modes. While models like GPT-4o approach human-level retrieval in a fraction of the time, smaller open-weight models fall short, highlighting the difficulty of literary reasoning at scale. These contributions advance the computational study of literature by centering interpretive depth as a core modeling challenge, laying groundwork for future systems that can read not just fluently, but insightfully.
Advisor
Mohit Iyyer