Unsupervised Discourse Modeling

08 Mar
Monday, 03/08/2010 11:00am to 12:00pm
Seminar

Aria Haghighi
University of California, Berkeley
Computer Science

Computer Science Building, Rooms 150 & 151

Faculty Host: Andrew McCallum

Most natural language processing (NLP) work examines language through a microscope, analyzing structure at or below the level of individual sentences. However, most of the information that we care about exists more globally, linking multiple sentences and even combining multiple sources (for example: articles, conversations, blogs and tweets.) Understanding this global information requires identifying the people, objects, and events as they evolve over a discourse. In this talk, I discuss unsupervised learning approaches to analyzing discourse information structure.

The initial step in understanding discourse structure is to recognize the entities (people, artifacts, locations, and organizations) being discussed and track their references throughout. Entities are referred to in many ways: with proper names ("Barack Obama"), nominal descriptions ("the President"), and pronouns ("he" or "him"). Reference resolution is the task of deciding to which entity a textual mention refers. I present a unified statistical model for reference resolution which can be learned in an unsupervised way (without labeled data) and incorporates rich semantic features. This model yields the best reference resolution results against other systems, supervised or unsupervised.

Once we understand the entities being referenced in a discourse, we must understand the events that they participate in, as well as the significance of these events to the main narrative. This aspect of discourse is central to tasks such as multi-document summarization, where a system must output a coherent summary of a large document collection. I present a hierarchical topic model for multi-document summarization which distinguishes central from auxiliary content. This model vastly improves summarization performance over other state-of-the-art approaches. Beyond giving a simple general summary, it can discover and summarize different aspects of document collection content.

A reception will be held at 3:40 PM in the atrium, outside the presentation room.