Putting Words Together: Crowdsourcing Data Collection for Lexical Similarity and Topical Coherence

12 Oct
Tuesday, 10/12/2010 11:00am to 12:00pm

Jordan Boyd-Graber
University of Maryland

Computer Science Building, Room 151

Faculty Host: Hannah Wallach

Geting quick and cheap data from humans is becoming easier with web-based crowdsourcing tools like Amazon Mechanical Turk. In this talk, I discuss two research projects that use Amazon Mechanical Turk to collect large amounts of data.

First, I discuss a data collection procedure to collect empirical, human-based judgments of how semantically similar concepts are. We call this measurement ``evocation,'' compare it with other similarity measures, and use it to improve assistive devices created for people suffering from aphasia, a debilitating neurological disorder.

Second, I discuss techniques to build human-centered measurements of the quality of topic models (e.g. document-centric models such as latent Dirichlet allocation and probabilistic latent semantic indexing, popular in information retrieval). After performing large-scale evaluations for these models, we found that, surprisingly, human judgments of quality don't necessarily correlate with traditional evaluations such as held-out likelihood that everybody traditionally uses to evaluate topic models. (This work was a Best Student Paper HM at NIPS 2009.)

If time permits, I will also discuss projects where crowdsourcing was not a viable means of collecting data: determining how transitive a verb is and whether a piece of text is persuasive.


Jordan Boyd-Graber in an assistant professor in the College of Information Studies at the University of Maryland. Previously, he worked as a postdoc with Philip Resnik at the University of Maryland. Until 2009, he was a graduate student at Princeton University working with David Blei on linguistic extensions of topic models.