Recent Computer Science Ph.D. graduates (2013)

Recent Computer Science Ph.D. graduates

Marc-Allen Cartright; Query-Time Optimization Techniques for Structured Queries in Information Retrieval; (James Allan, Advisor); Sept. 2013; Software Engineer, Google, Inc.

Information retrieval (IR) systems are evolving towards larger, more complicated queries. From an operational perspective, larger queries translate into an increasing computational cost to respond to a query. This causes an increasing tension in the trade-off between retrieval effectiveness and efficiency. This tension creates a strong need for optimization techniques to improve the efficiency of ranking with respect to these more complex retrieval models. This thesis presents three new optimization techniques designed to deal with different aspects of structured queries: 1) manipulation of interpolated subqueries, a commonly used query structure; 2) development of an alternative scoring formulation to make retrieval models more responsive to dynamic pruning techniques; and 3) introduction of delayed execution, which focuses on the class of queries that utilize term dependencies and term conjunction operations. In each case we empirically show that these optimizations can significantly improve query processing efficiency without impacting retrieval effectiveness.

Shane Clark; The Security and Privacy Implications of Energy-Proportional Computing; (Kevin Fu, Advisor); Sept. 2013; Research Scientist, BBN Technologies

The parallel trends of greater energy-efficiency and more aggressive power management are yielding computers that inch closer to energy-proportional computing with every generation. Energy-proportional computing, in which power consumption scales closely with workload, has unintended side effects for security and privacy. This thesis demonstrates the potential for system-level power analysis--the inference of a computer's internal states based on power observation at the "plug." It also examines which hardware components and software workloads have the greatest impact on information leakage. This work identifies the potential for privacy violations by demonstrating that a malicious party could identify which webpage from a given corpus a user is viewing. It also identifies constructive applications for power analysis, evaluating its use as an anomaly detection mechanism for embedded devices. Finally, this thesis includes modeling work that correlates AC and DC power consumption to pinpoint which components contribute most to information leakage.

Henry Feild; Exploring Privacy and Personalization in Select Information Retrieval Applications; (James Allan, Advisor); Sept. 2013; Assistant Professor, Dept. of Computer Science, Endicott College

The goal of this work is to explore the effects of personalization and privacy preservation methods on three information retrieval applications, namely search task identification, task-aware query recommendation, and searcher frustration detection. We pursue this goal by first introducing a novel framework called CrowdLogging for logging and aggregating data privately over a distributed set of users. We then describe several privacy mechanisms for sanitizing global data, including one novel mechanism based on differential privacy. We present a template for describing how local user data and global aggregate data are collected, processed, and used within an application, and apply this template to our three applications. We then introduce an open source system called CrowdLogger that implements the CrowdLogging framework and also serves as a platform for conducting in-situ user studies of search behavior, prototyping and evaluating information retrieval applications, and collecting labeled data.

Jacqueline Feild; Improving Text Recognition in Images of Natural Scenes; (Erik Learned-Miller, Advisor); Feb. 2014; Data Scientist, McGraw-Hill Education

This thesis develops methods for improving scene text recognition. We do this by incorporating new types of information into models and by exploring how to compose simple components into highly effective systems. First, we introduce two techniques for character recognition. We describe a character recognition system that incorporates similarity information in a novel way and a new language model that models syllables in a word to produce word labels that can be pronounced in English. Next we look at word recognition and we develop a new technique for segmenting text for these images called bilateral regression segmentation. We also introduce an open-vocabulary word recognition system that uses a very large web-based lexicon to achieve state of the art recognition performance. Lastly, we remove the assumption that words have been located and describe an end-to-end system that detects and recognizes text in any natural scene image.

Samuel Huston; Indexing Proximity Based Dependencies for Information Retrieval; (W. Bruce Croft, Advisor); Feb. 2014; Software Engineer, Google, Inc.

Research into term dependencies for information retrieval has demonstrated that dependency retrieval models are able to consistently improve retrieval effectiveness over bag-of-words models. However, the computation of term dependency statistics is a major efficiency bottleneck in the execution of these retrieval models. Further, despite the large number of published comparisons between dependency models and bag-of-words approaches, there has been a lack of direct comparisons between alternate dependency models. This thesis investigates the problem of improving the efficiency of dependency retrieval models without compromising the effectiveness benefits of the term dependency features. The major contributions presented in this thesis includes a systematic comparison of ad-hoc dependency models; the presentation and analysis of two new index data structures for term dependency data; and finally a comparison of the efficiency of these data structure for the most effective term dependency models identified by the systematic comparison.

Phillip Kirlin; A Probabilistic Model of Hierarchical Music Analysis; (David Jensen, Advisor); Feb. 2014; Assistant Professor, Dept. of Mathematics and Computer Science, Rhodes College

Schenkerian music theory supposes that Western tonal compositions can be viewed as hierarchies of musical objects. The process of Schenkerian analysis reveals this hierarchy by identifying connections between notes or chords of a composition that illustrate both the small- and large-scale construction of the music. We present a new probabilistic model of this variety of music analysis, details of how the parameters of the model can be learned from a corpus, an algorithm for deriving the most probable analysis for a given piece of music, and both quantitative and human-based evaluations of the algorithm's performance. In addition, we describe the creation of the corpus, the first publicly available data set to contain both musical excerpts and corresponding computer-readable Schenkerian analyses. Combining this corpus with the probabilistic model gives us the first completely data-driven computational approach to hierarchical music analysis.

Chao Li; Optimizing Linear Queries Under Differential Privacy; (Gerome Miklau, Advisor); Sept. 2013; Software Engineer, Google, Inc.

Statistical data analysis on large collections of personal data can lead to fascinating results but also raises the privacy risk unwanted information disclosure. Differential privacy is a rigorous privacy definition that protects individuals' information during statistical data analysis. While it is straightforward to construct differentially private algorithms for many common tasks, methods to design error-optimal algorithms for most non-trivial tasks are still unknown. This thesis proposes the matrix mechanism, which answers sets of linear counting queries under differential privacy. Such queries cover the scope of many aggregation tasks, including count, sum and histogram. The thesis also contains an analysis of the matrix mechanism, including a closed-form error formulation and optimization programs to minimize the error. Further, it presents two algorithms. One based on the matrix mechanism and the other combining the matrix mechanism with other novel techniques, both of which answer various sets of queries with error lower than state-of-art algorithms.

Manjunath Narayana; Probabilistic Models for Motion Segmentation in Image Sequences; (Allen Hanson and Erik Learned-Miller, Advisors); Feb. 2014; Research Engineer, Metaio, Inc.

Motion segmentation, or labeling image pixels as moving or stationary, is an important task in computer vision. This thesis makes contributions towards segmentation with both stationary and moving cameras. For stationary cameras, we develop a probabilistic model that intuitively combines the various aspects of the problem in a system that is easy to interpret and to extend. For moving cameras, segmentation is commonly performed using the image plane motion of pixels, or optical flow. However, objects that are at different depths from the camera can exhibit different optical flows, causing a depth-dependent scene segmentation. We achieve a depth-independent segmentation that is consistent with real-world motion in the scene by using optical flow orientations instead of complete vectors. We propose a non-parametric probabilistic model that automatically estimates the number of motions and a rotation compensation algorithm that enables segmentation in a wide range of challenging hand-held camera videos

Scott Niekum; Semantically Grounded Learning from Unstructured Demonstrations; (Andrew Barto, Advisor); Sept. 2013; Postdoctoral Researcher, Carnegie Mellon University

Robots exhibit flexible behavior largely in proportion to their degree of semantic knowledge about the world.  Such knowledge is often meticulously hand-coded for a narrow class of tasks, limiting the scope of possible robot competencies. For this reason, learning from demonstration (LfD) has become a popular alternative to traditional robot programming methods, aiming to provide a natural mechanism for quickly teaching robots. Unfortunately, LfD often yields little semantic knowledge about the world, and thus lacks robust generalization capabilities, especially for complex, multi-step tasks. To address this shortcoming of LfD, we present a series of algorithms that automatically detect and leverage repeated structure at multiple levels of abstraction in demonstration data, providing critical insights into task invariants, features of importance, and high-level task structure.  This culminates in the discovery of semantically meaningful skills that are flexible and reusable, providing robust generalization and transfer in complex, multi-step robotic tasks.

James Partan; Characterization and Network Consequences of Low Spreading Loss in Underwater Acoustic Networks; (Brian Levine, Advisor); Sept. 2013; Research Engineer, Woods Hole Oceanographic Institution

The focus of this thesis is packet detection in interference in underwater acoustic wireless networks (UANs), and its role in the effectiveness of collision-avoidance medium-access control (MAC) protocols. Spreading-loss measures the decrease in received energy as a function of range, and determines the level of long-range interference.  We present a new spreading model, the mixed-exponent spreading model, for UAN nodes using a matched-filter detector as a low-power wakeup detector. Under this model, there are distinct spreading-loss exponents for packet detection and interference. We validate this spreading model numerically, and with measurements of the spreading exponents from shallow-water experimental data. Our results suggest caution for use of the poorly grounded, but widely used, standard spreading model.  We next analyze the effectiveness of collision-avoidance MAC protocols in UANs. The low spreading loss in UANs, in particular with the mixed-exponent spreading model, can lead to low collision-avoidance effectiveness compared with radio networks.

Ismet (Zeki) Yalniz; Efficient Representation and Matching of Texts and Images in Scanned Book Collections; (R. Manmatha, Advisor); Feb. 2014; Software Engineer, Amazon, A9.com

Several research problems are investigated over large collections of scanned books given their page images and corresponding optical character recognition (OCR) outputs. First, we propose general framework which can be used to efficiently align and compare the textual content of the books at various coarseness levels and even across languages. The framework uses the sequence of words which appear only once in the entire book ("the sequence of unique words'') to represent the text. This approach is used for aligning long noisy texts, detecting partial duplicates and translations of books, and, aligning texts written in different languages. In the second part, the global font feature along with the letter sequence information is used for facilitating and/or improving text search in noisy page images using visual features. The effectiveness is demonstrated for books printed in different scripts for which there is no OCR engine available or the recognition accuracy is low.