Faculty Recruiting Support CICS

Using Poetry Data to Challenge Assumptions in Text Understanding

12 Mar
Tuesday, 03/12/2019 10:00am to 12:00pm
CS 150
Ph.D. Seminar
Speaker: John Foley

Abstract:

Modern advances in natural language processing (NLP) and information retrieval (IR) provide for the ability to automatically understand, categorize, process and search textual resources. However, generalizing these approaches remains an open problem: models that appear to understand certain types of data must be re-trained on other domains.

Often, models make assumptions about the length, structure, discourse model and vocabulary used by a particular corpus. Trained models can often become biased toward an original dataset, learning that -- for example -- all capitalized words are names of people or that short documents are more relevant than longer documents. As a result, small amounts of noise or shifts in style can cause models to fail on unseen data. The key to more robust models is to look at text understanding tasks on more challenging and diverse data.                   

Poetry is an ancient art form that is believed to pre-date writing and is still a key form of expression through text today. Some poetry forms (e.g., haiku and sonnets) have rigid structure but still break our traditional expectations of text. Other poetry forms drop punctuation and other rules in favor of expression.                 

We study text understanding and retrieval tasks with a focus on this adversarial domain to better understand artistic and emotional content. In addition, our contributions include a set of novel, challenging datasets that extend traditional tasks: a text classification task for which content features perform poorly, a named entity recognition task that is inherently ambiguous, and a retrieval corpus over the largest public collection of poetry ever created.

We begin by looking at poetry identification - the task of finding poetry within existing textual collections, like the millions of digitally scanned books that are now publicly available. Then we work on the modeling of poetry: identifying entities is a challenging task because it requires devising NLP models that are not dependent on typical syntactic structures. Finally, we return to IR, and look at the challenges that retrieval models face on poetry data. For all of our tasks, we discuss how the lessons learned from our poetry-inspired models may apply on traditional tasks and datasets.

Advisor: James Allan