Recent Computer Science Ph.D. graduates (May 2014)

July 01, 2014

Elif Aktolga; Integrating Non-Topical Aspects into Information Retrieval; (James Allan, Advisor); May 2014; Research Engineer, Apple
When users investigate a topic, they are often interested in results that are not just relevant, but also strongly opinionated or covering a range of times. Often several queries need to be issued with reformulations if initial search results are not satisfactory. In this thesis, we focus on two non-topical dimensions: opinionatedness and time. For improving search results with respect to non-topical dimensions, we use diversification approaches. Results are diversified across a single or multiple non-topical dimensions. The burden of analyzing pre-existing biases for a query and discovering times at which important events happened is fully carried by the system. We show how to combine several dimensions with individual biases for each, while also presenting approaches to time and sentiment diversification. The insights from this work will be very valuable for next generation search engines and retrieval systems.

Jeff Dalton; Entity-based Enrichment for Information Extraction and Retrieval; (James Allan, Advisor); May 2014; Software Engineer, Google Inc.
The goal of this work is to leverage knowledge of the world to improve understanding of queries and documents using entities. An entity is a thing or concept that exists in the world, such as a politician, a battle, a film, or a color. Entity-based enrichment (EBE) is a new expansion model for both queries and documents using features from similar entity mentions in the document collection and external knowledge bases, such as Freebase and Wikipedia. With the ultimate goal of improving information retrieval effectiveness, we start from unstructured text and through information extraction, build up rich entity-based representations linked to external knowledge resources. We study the application of entity-based enrichment to improve the effectiveness of each step in the pipeline: 1) Named Entity Recognition, 2) Entity Linking, and 3) Ad hoc document retrieval. The empirical results for EBE in each of these tasks shows significant improvements.

Van Dang; A Proportionality-based Approach to Search Result Diversification; (W. Bruce Croft, Advisor); May 2014; Software Engineer, Google Inc.
Search result diversification addresses the problem of queries with unclear information needs by providing a document ranking that covers multiple possible topics for a given query. This increases the likelihood that users will find documents relevant to their specific intent. This dissertation introduces a new perspective on diversity: diversity by proportionality. We consider a result list more diverse, with respect to some set of query topics, when the ratio between the number of documents it provides for each topic matches more closely with the topic popularity distribution. Consequently, we derive a ranking framework for optimizing proportionality and an effectiveness measure. We also show that topical diversity can be achieved by diversifying search results using a set of terms that describe the query topics. This simplifies the task of finding a topic set to finding a term set. We present a technique and several data sources for generating these terms effectively.

Dan Gyllstrom; Making Networks Robust to Component Failures; (James Kurose, Advisor); May 2014; Senior Performance Engineer, Akamai Technologies
In this thesis, we consider instances of component failure in the Internet and in networked cyber-physical systems, such as the communication network used by the modern electric power grid (termed the smart grid). We design algorithms that make these networks more robust to various component failures, including failed routers, failures of links connecting routers, and failed sensors. This thesis divides into three parts: recovery from malicious or misconfigured nodes injecting false information into a distributed system (e.g., the Internet), placing smart grid sensors to provide measurement error detection, and fast recovery from link failures in a smart grid communication network.

Andrew Kae; Incorporating Boltzmann Machine Priors for Semantic Labeling in Images and Videos; (Erik Learned-Miller, Advisor); May 2104
Semantic labeling is the task of assigning category labels to regions in an image. For example, a scene may consist of regions corresponding to categories such as sky, water, and ground. Labeling regions allows us to better understand the scene itself as well as properties of the objects and their interactions within the scene. Typical approaches for this task include the conditional random field (CRF), which is well-suited to modeling local interactions among adjacent image regions. However the CRF may be limited in dealing with complex, global (long-range) interactions between regions in an image, and between frames in a video. This thesis presents ways to extend the CRF framework and incorporate priors based on the restricted Boltzmann machine (RBM) to model long-range interactions within images and video, for use in semantic labeling.

Tongping Liu; Reliable and Efficient Multithreading; (Emery Berger, Advisor); May 2014; Assistant Professor, Univ. of Texas San Antonio
To take advantage of multiple cores, software needs to be written using multithreading. It is notoriously far more challenging to write multithreaded programs correctly and efficiently than sequential ones. developed systems to combat both concurrency errors and performance issues in multithreaded programs. I developed Dthreads, a deterministic threading library that automatically ensures deterministic executions for unmodified C/C++ applications, without requiring programmer intervention or hardware support. Dthreads often matches or even exceeds the performance of standard thread libraries, making deterministic multithreading a practical alternative for the first time. I developed two other systems to attack false sharing, a performance issue that arises when multiple threads access distinct parts of the same cache line simultaneously. The first, Predator, not only precisely identifies but also predicts potential false sharing that does not get manifested. The second system, Sheriff-Protect, automatically eliminates false sharing inside parallel applications without programmer intervention.

Marwan Mattar; Unsupervised Joint Alignment, Clustering and Feature Learning; (Allen Hanson and Erik Learned-Miller, Advisors); May 2014; Research Data Scientist, Electronic Arts
Joint alignment is the process of transforming instances in a data set to make them more similar based on a pre-defined measure of joint similarity. This process has great utility in many scientific disciplines including radiology, psychology, and vision. This thesis takes steps towards developing an unsupervised data processing pipeline that includes alignment, clustering and feature learning. We first present an efficient curve alignment algorithm that is effective on many synthetic and real data sets. We show that using the byproducts of joint alignment, the aligned data and transformation parameters, can dramatically improve classification performance. We then incorporate unsupervised feature learning based on convolutional restricted Boltzmann machines to learn a representation that is tuned to the statistics of the data set. We show how these features can be used to improve both the alignment quality and classification performance. Finally, we present a nonparametric Bayesian joint alignment and clustering model which handles data sets arising from multiple modes.

Sameer Singh; Scaling MCMC Inference and Belief Propagation to Large, Dense Graphical Models; (Andrew McCallum, Advisor); May 2014; Postdoctoral Research Associate, Department of Computer Science, Univ. of Washington
In the past decade, single-core CPUs have given way to multi-core and distributed computing platforms. At the same time, access to large data collections is progressively becoming commonplace. Inference for probabilistic graphical models, that has been designed to operate sequentially, seems destined to become obsolete in this world of multi-core, multi-node systems. Further, modeling large datasets leads to an escalation in the number of variables, factors, domains, and the density of the models, all of which have a substantial impact on the computational complexity of inference. Motivated by the need to scale inference to large, dense graphical models, in this thesis we explore approximations to Markov chain Monte Carlo (MCMC) and belief propagation (BP) that induce dynamic sparsity in the model to utilize parallelism. These tools for inference enable us to tackle relation extraction, entity resolution, cross-document coreference, and other information extraction tasks over large text corpora.