Recent Ph.D. graduates: Where are they now

Recent Ph.D. graduates

The following students have graduated with Ph.D.s from UMass Amherst Computer Science from AY 07-08:

Ron Bekkerman: Combinatorial Markov Random Fields and their Applications to Information Organization (James Allan, Advisor); Feb. 2008; Research Scientist, HP Labs.

We propose a new type of undirected graphical model called a Combinatorial Markov Random Field (Comraf) and discuss its advantages over existing graphical models. We develop an efficient inference methodology for Comrafs based on combinatorial optimization of information-theoretic objective functions; both global and local optimization schema are discussed. We apply Comrafs to multi-modal clustering tasks: standard (unsupervised) clustering, semi-supervised clustering, interactive clustering, and one-class clustering. For the one-class clustering task, we analytically show that the proposed optimization method is optimal under certain simplifying assumptions. We empirically demonstrate the power of Comraf models by comparing them to other state-of-the-art machine learning techniques, both in text clustering and image clustering domains. For unsupervised clustering, we show that Comrafs consistently and significantly outperform three previous state-of-the-art clustering techniques on six real-world textual datasets. For semi-supervised clustering, we show that the Comraf model is superior to a well-known constrained optimization method. For interactive clustering, Comraf obtains higher accuracy than a Support Vector Machine trained on a large amount of labeled data. For one-class clustering, Comrafs demonstrate superior performance over two previously proposed methods. We summarize our thesis by giving a comprehensive recipe for machine learning modeling with Comrafs.

Aron Culotta: Learning and Inference in Weighted Logic with Application to Natural Language Processing (Andrew McCallum, Advisor); May 2008; Assistant Professor, Southeastern Louisiana University.

Over the past two decades, statistical machine learning approaches to natural language processing have largely replaced earlier logic-based systems. These probabilistic methods have proven to be well-suited to the ambiguity inherent in human communication. However, the shift to statistical modeling has mostly abandoned the representational advantages of logic-based approaches. For example, many language processing problems can be more meaningfully expressed in first-order logic rather than propositional logic. Unfortunately, most machine learning algorithms have been developed for propositional knowledge representations.

In recent years, there has been a number of attempts to combine logical and probabilistic approaches to artificial intelligence. However, their impact on real-world applications has been limited because of serious scalability issues that arise when algorithms designed for propositional representations are applied to first-order logic representations. In this thesis, we explore approximate learning and inference algorithms that are tailored for higher-order representations and demonstrate that this synthesis of probability and logic can significantly improve the accuracy of several language processing systems.

Peter Desnoyers: Distributed Sensing Networks: Archiving, Indexing, and Querying (Prashant Shenoy, Advisor); February 2008; Assistant Professor, Northeastern University.

This thesis advocates data-aware and storage-centric approaches to data handling in distributed data collection environments. Raw data is stored locally, and only selected, relevant information reaches the application. We apply two strategies: (i) archiving, indexing and querying to retrieve discrete portions of data, and (ii) mathematical modeling to extract relationships spread diffusely across the data. We propose mechanisms for these strategies and evaluate them in appropriate system contexts.

First we address storage and indexing of high-speed event data. A disk-based storage system guarantees high write rates, and a signature file-based index to index high-speed data in real time. We evaluate a network monitor using these mechanisms.

Next, we examine indexing and retrieval in low-power wireless sensor networks. Sensors send data summaries to more powerful proxy nodes, which use a novel index structure, the Interval Skip Graph, representing multiple records by a single imprecise key. We present a prototype sensor network storage system using these mechanisms.

Lastly we address analysis of feature-rich but poorly structured events. Statistical machine learning techniques create models of application behavior in a data center using automated feature identification to identify model inputs within the raw data. We present and evaluate Modellus, a monitoring system using these mechanisms.

Fernando Diaz: Regularizing Query-Based Retrieval Scores (James Allan, Advisor); February 2008; Research Scientist, Yahoo!, Montreal.

Query-based information retrieval refers to the process of scoring documents given a short natural language query. Query-based information retrieval systems have been developed to support searching diverse collections such as the world wide web, personal email archives, news corpora, and legal collections. This thesis is motivated by one of the tenets of information retrieval: the cluster hypothesis. We define a design principle based on the cluster hypothesis which states that retrieval scores should be locally consistent. We refer to this design principle as score autocorrelation. Our experiments show that the degree to which retrieval scores satisfy this design principle correlates positively with system performance. We use this result to define a general, black box method for improving the local consistency of a set of retrieval scores. We refer to this process as local score regularization. We demonstrate that regularization consistently and significantly improves retrieval performance for a wide variety of baseline algorithms. Regularization is closely related to classic techniques such as pseudo-relevance feedback and cluster-based retrieval. We demonstrate that the effectiveness of these techniques may be explained by their regularizing behavior. We argue that regularization should be adopted either as a generic post-processing step or as a fundamental design principle for retrieval models.

Shaolei Feng: Statistical Models for Text Query-Based Image Retrieval (R. Manmatha, Advisor); May 2008; Research Scientist, Siemens Corporate Research.

Image indexing and retrieval is a challenging problem. Traditional content-based approaches make use of queries based on image examples or image attributes like color and texture. However, they do not capture the semantics or meanings of images. Generally, libraries and other organizations manually annotate each image with keywords and captions. The disadvantage of this approach is the huge cost of annotating a large number of images and the inconsistency of annotations by different people. In this work, we focus on general image and historical handwritten document retrieval based on textual queries. We explore three statistical model-based techniques that allow us to retrieve general images and historical handwritten document images with text queries: (i) image retrieval based on automatic annotation, (ii) direct retrieval based on computing the posterior of an image given a text query, and (iii) handwritten document image recognition.

The contributions of this work include (i) two probabilistic generative models for annotation-based retrieval, (ii) a direct retrieval model for general images, and (iii) an investigation of machine learning models for handwritten document recognition. Our experimental results show that our approaches may be applied to practical textual query-based retrieval systems on large image data sets.

Jiwoon Jeon: Searching Question and Answer Archives (W. Bruce Croft, Advisor); Sept. 2007; Software Engineer, Google Inc.

Archives of questions and answers are a valuable information source. However, little research has been done to exploit them. We propose a new type of information retrieval system that answers users' questions by searching question and answer archives. The proposed system has many advantages over current web search engines. In this system, natural language questions are used instead of keyword queries, and the system directly returns answers instead of lists of documents. Two most important challenges in the implementation of the system are finding semantically similar questions to the user question and estimating the quality of answers. We propose using a translation-based retrieval model to overcome the word mismatch problem between questions. Our model combines the advantages of the IBM machine translation model and the query likelihood language model and shows significantly improved retrieval performance over the state of the art retrieval models. We also show that collections of question and answer pairs are good linguistic resources for learning reliable word-to-word translation relationships. To avoid returning bad answers to users, we build an answer quality predictor based on statistical machine learning techniques. By combining the quality predictor with the translation-based retrieval model, our system successfully returns relevant and high quality answers to the user.

Giridhar Kumaran: Efficient User Interaction In Information Retrieval (James Allan, Advisor); May 2008; Scientist, Microsoft Live Labs.

We present new ways of interacting with a user based on query analysis and reformulation. Our goal is to improve retrieval performance and help users understand the retrieval process and collection searched. We do this by showing users the potential impact their decisions will have on the retrieval process, so they can make more informed choices from the options presented to them.

Unlike previous work in user interaction where a one-procedure-fits-all strategy was pursued, we invoke user interaction only when there is potential for improvement because tedious interaction can have an unfavorable impact on user experience. We present techniques for selective interaction and show their utility in the context of two interaction techniques we have developed. Our results show that user interaction can be avoided in most cases without much deterioration in performance.

User interaction can be made more productive by providing users with an optimally-sized set of high quality options. We present efficient techniques to determine such a set. When faced with a decision to interact with a user given a query, we determine the best interaction technique for that query by obtaining implicit feedback from the user. By utilizing these interaction-related techniques, we show through simulations and user studies that users can obtain better performance with less effort.

Audrey Lee: Geometric Constraint Systems with Applications in CAD and Biology (Ileana Streinu, Advisor); May 2008; Visiting Professor, Department of Computer Science, Mount Holyoke College.

Motivated by applications in Computer Aided Design (CAD) and biology, we investigate geometric constraint systems, composed of atomic elements with constraints between them. A well-studied model in rigidity theory is the bar-and-joint structure, where the atomic elements are universal joints connected by fixed-length bar constraints. We propose several new models involving constraints arising in CAD and biology and provide the theoretical foundation for each. In particular, we present a model addressing the pairwise constraints among points, lines and planes found in constraint-based CAD software, such as in the assembly environment of the widely-used SolidWorks CAD application.

As a result, we identify and generalize combinatorial properties that appear as necessary conditions for generic rigidity of these new constraint systems; in some cases, the conditions are also sufficient, thus providing a complete characterization. We study sparsity for graphs arising from known rigidity results and present extensions of this concept to graded, mixed and nested sparsity. For these sparsity properties, we present algorithms that solve the fundamental questions of Decision, Extraction and Components, based on the efficient and elegant pebble games first developed for planar bar-and-joint rigidity.

Wei Li: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations (Andrew McCallum, Advisor); Sept. 2007; Senior Software Development Engineer, Yahoo! Inc.

Statistical topic models are increasingly popular tools for summarization and manifold discovery in discrete data. However, the majority of existing approaches capture no or limited correlations between topics. We propose the pachinko allocation model (PAM), which captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). We present various structures within this framework, different parameterizations of topic distributions, and an extension to capture dynamic patterns of topic correlations. We also introduce a nonparametric Bayesian prior to automatically learn the topic structure from data. The model is evaluated on document classification, likelihood of held-out data, the ability to support fine-grained topics, and topical keyword coherence. With a highly-scalable approximation, PAM has also been applied to discover topic hierarchies in very large datasets.

Marc Liberatore: Low-Latency Anonymity Systems: Statistical Attacks and New Applications (Brian Levine, Advisor); Feb. 2008; Mellon CTW Postdoctoral Fellow/Visiting Faculty, Department of Math and Computer Science, Wesleyan University.

In this dissertation, we study low-latency anonymity protocols and systems. Such systems enable anonymous communication where latency is not tolerated well, such as browsing the web, but also introduce new vulnerabilities not present in systems that hide timing information. We examine one such vulnerability, the profiling attack, as well as possible defenses. We also examine the feasibility of using low-latency anonymity techniques to support a new application, Voice over IP (VoIP). First, we show that profiling attacks on low-latency anonymity systems are feasible. The attack is based upon pre-constructing profiles of communication, and using them to identify the sender of encrypted, anonymized traffic. Second, we present results from a large-scale measurement study, which indicate that profiling is practical across sets of thousands of possible senders, and that such profiles remain valid for weeks at a time. Third, we evaluate defenses against the profiling attack and their effects upon system performance. We then demonstrate the feasibility of supporting anonymous VoIP; specifically, we show supporting measurement data and outline the changes current anonymity systems would require. We also show how such systems are potentially more vulnerable to known attacks, and examine the tradeoffs between VoIP performance and anonymity inherent in such systems.

Junning Liu: On Joint Coding and Combinatorial Optimization in Communication Networks (Donald Towsley and Micah Adler, Advisors); Feb. 2008; R & D Software Engineer, YouTube Analytics, a subsidiary of Google.

Information flows in communication networks are treated as commodity flows. Network coding is a technique that allows nodes to manipulate the information content arbitrarily inside the network. This work focuses on the joint coding/routing optimization for throughput, energy consumption, and robustness of wired and wireless networks. We strive to minimize communication costs of collecting correlated data through a network at a sink, how to maximize the throughput of multi-pair independent unicast traffic, and how to maximize the data utility in random, unreliable data collection networks. We introduce "distance entropy", which characterizes the spatial distribution of information at networked sources; we show that it is a meaningful metric as the tight lower bound of the minimum communication cost for collecting the sources at a single sink. We propose a practical coding/ routing algorithm called Hierarchical Difference Broadcasting (HDB). We also revisit Gupta & Kumar's work on the capacity of wireless ad hoc networks and consider it in a setting that allows arbitrary coding at the nodes. We characterize the throughput capacity order and show coding combined with wireless broadcasting provides no order difference improvement. Finally, we apply coding to improve the network robustness and analyze a joint coding & scheduling problem.

Xiaoyong Liu: Cluster-Based Retrieval from the Language Modeling Perspective (W. Bruce Croft, Advisor); Feb. 2008; Software Design Engineer V, Hewlett Packard.

It is generally assumed that the relevance of documents can be assessed independently. The fact that a document is relevant does not contribute to predicting the relevance of a closely-related document. In contrast, cluster-based retrieval assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query.

The most common approach to cluster-based retrieval is to retrieve one or more clusters in their entirety. Research in this area suggests that optimal clusters could yield very large improvements in effectiveness relative to document retrieval. However, no retrieval strategy has achieved this result; document retrieval is generally more effective. One area of recent research is to use clusters as a form of document smoothing.

We retrieve the best group of documents, from the language-modeling perspective. We study both cluster smoothing and cluster retrieval. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize good document clusters, and develop new probabilistic representations that capture the identified features. An extensive empirical evaluation is provided . We argue that the use of good document clusters by an IR system depends on how they are represented.

Donald Metzler: Effectively Modeling Term Dependencies in Information Retrieval (W. Bruce Croft, Advisor); Sept. 2007; Research Scientist, Yahoo! Research.

Current state of the art information retrieval models treat documents and queries as bags of words. There have been many attempts to go beyond this simple representation. Unfortunately, few have shown consistent improvements in retrieval effectiveness across a wide range of tasks and data sets. Here, we propose a new statistical model for information retrieval based on Markov random fields. The proposed model goes beyond the bag of words assumption by allowing dependencies between terms to be incorporated into the model. This allows for a variety of textual and non-textual features to be easily combined under the umbrella of a single model. Within this framework, we explore the theoretical issues involved, parameter estimation, feature selection, and query expansion. We give experimental results from a number of information retrieval tasks, such as ad hoc retrieval and web search.

Trevor Strohman: Efficient Processing of Complex Features for Information Retrieval (W. Bruce Croft, Advisor); Feb. 2008; Software Engineer, Google Inc.

Text search systems research has primarily focused on simple occurrences of query terms within documents to compute document relevance scores. However, recent research shows that additional document features are crucial for improving retrieval effectiveness. We develop a series of techniques for efficiently processing queries with feature-based models. Our TupleFlow framework, an extension of MapReduce, provides a basis for custom binned indexes, which efficiently store feature data. Our work in binning probabilities shows how to effectively map language model probabilities into the space of small positive integers, which helps improve speeds without reducing query effectiveness. We also show new efficient query processing results for both document-sorted and score-sorted indexes. All of our work is evaluated using the largest available research dataset.

Kyoungwon Suh: Monitoring, Measurement, and Control of Multimedia Traffic in IP Networks (James F. Kurose and Donald F. Towsley, Advisors); Sept. 2007; Assistant Professor, Illinois State University at Normal.

We propose several architectural components that monitor and measure multimedia traffic at the edge and in the core of IP networks and serve stored multimedia contents to clients in edge networks. We first present a technique to detect Skype-relayed traffic from passively measured packet traces collected at the edge of the Internet. We propose several flow-level metrics to characterize the nature of relayed traffic that, together with the results obtained from the experimental characterization of Skype-relayed traffic, are used to detect Skype-relayed traffic. We next consider the problem of optimal placement within the network and sampling rates for the monitors. We formulate several minimum-cost, maximum-coverage problems under various budget constraints and show that they are NP-hard. We propose and evaluate greedy heuristics. Last, we introduce and evaluate the Push-to-Peer architecture for streaming video among cooperating nodes in an edge network, that can drastically reduce the load posed on the core and access links of the network and on the streaming servers. The main departure from previous designs is that content is proactively pushed to peers and persistently stored before the actual peer-to-peer transfers.

Charles Sutton: Local Training for Conditional Random Fields (Andrew McCallum, Advisor); Feb. 2008; Postdoctoral Researcher, Department of Electrical Engineering and Computer Science, University of California Berkeley.

Many applications require predicting multiple interdependent variables. Recent attention has therefore focused on structured prediction methods, which combine the modeling flexibility of graphical models with the complex, dependent features of traditional classification methods. Especially popular have been conditional random fields (CRFs)-graphical models of the conditional distribution over outputs given a set of observed features. Unfortunately, parameter estimation in CRFs requires repeated inference, which can be computationally expensive. Complex graphical structures are increasingly desired in practical applications, but then training time often becomes prohibitive.

I investigate efficient training methods for CRFs with complex graphical structure, focusing on local methods to avoid propagating information globally. First, I investigate piecewise training, which trains each of a model's factors separately. I present three views of piecewise training: as maximizing the likelihood in a so-called "node-split graph", as maximizing the Bethe likelihood with uniform messages, and as generalizing the pseudo-moment matching estimator of Wainwright et al (2003). Second, I propose piecewise pseudolikelihood, a hybrid procedure which "pseudolikelihood-izes" the piecewise likelihood, and is more efficient if the variables have large cardinality. Piecewise pseudolikelihood performs well even on applications in which standard pseudolikelihood performs poorly. Finally, motivated by the connection between piecewise training and BP, I explore training methods using beliefs arising from stopping BP before convergence.

Xing Wei: Topic Models in Information Retrieval (W. Bruce Croft, Advisor); Sept. 2007; Research Scientist, Yahoo! Inc.

Topic modeling demonstrates the semantic relations among words, which should be helpful for information retrieval tasks. We present probability mixture modeling and term modeling methods to integrate topic models into language modeling framework for information retrieval. A variety of topic modeling techniques, including manually-built query models, term similarity measures and latent mixture models, especially Latent Dirichlet Allocation (LDA), a formal generative latent mixture model of documents, have been proposed or introduced into IR tasks. We investigated and evaluated them on several TREC collections within presented frameworks, and show that significant improvements over previous work can be obtained. Practical problems such as efficiency and scaling considerations are discussed and compared for different topic models. Other recent topic modeling techniques are also discussed.

Jerod Weinman: Unified Detection and Recognition for Reading Text in Scene Images (Allen Hanson and Erik Learned-Miller, Advisors); May 2008; Assistant Professor, Department of Computer Science, Grinnell College.

Computers can currently "read" document text for the blind about as well as a seven-year-old. Scene text recognition brings new challenges. A central limitation of current approaches is a feed-forward, bottom-up, pipelined architecture that isolates the many tasks and information involved in reading. The result is a system that commits irreversible errors and has components that lack access to relevant information.

We propose a system for scene text reading that in its design, training, and operation is more integrated, which could lead to improved performance. First, we present a simple contextual model for text detection that is ignorant of any recognition. Through the use of special features and data context, this model performs well on the detection task, but limitations remain due to the lack of interpretation. We then introduce a recognition model that integrates several information sources, including font consistency and a lexicon, and compare it to pipelined approaches. Next we examine a unified framework where features are selected for the joint detection and recognition task, yielding better results with fewer features. Finally, we demonstrate a model that integrates segmentation and recognition of both words and characters to recognize text with difficult layouts and low resolution more accurately.

Xiaolan Zhang: Routing in DTN: Performance Modeling, Network Coding Benefit and Real Trace Studies (James Kurose and Donald Towsley, Advisors); Sept. 2007; Assistant Professor, Fordham University.

We study three problems on unicast routing in Disruption-Tolerant Networks (DTNs). First, we propose a unified framework based on Ordinary Differential Equations (ODEs) to study the performance of a class of epidemic style routing schemes. Derived as the limit of Markov process model under proper scaling, the ODE models capture the propagation and recovery process of data packets. We derive a rich set of closed-form results using the ODE models, and quantitatively characterize their performance.

Next, we investigate the benefit of applying Random Linear Coding (RLC) in resource-constrained DTNs. We explore different ways to apply network coding and study cases with different network traffic load. We show that the RLC-based scheme achieves minimal block delivery delay with high probability and improves the tradeoff between block delivery delay and number of transmissions.

Last, we study mobility traces taken from UMass DieselNet. We analyze the bus-to-bus contact traces to develop a generative model and then generate synthetic traces. We show that an aggregate model for the inter-contact time is too coarse to accurately capture DTN routing performance. We construct a route-level inter-contact time model, which captures interesting mobility structure within the trace.

Yun Zhou: Retrieval Performance Prediction and Document Quality (W. Bruce Croft, Advisor); Feb. 2008; Software Engineer, Google, Inc.

The ability to predict retrieval performance has potential applications in many important IR (Information Retrieval) areas. In this thesis, we study the problem of predicting retrieval quality at the granularity of both the retrieved document set as a whole and individual retrieved documents. At the level of ranked lists of documents, we propose several novel prediction models that capture different aspects of the retrieval process that have a major impact on retrieval effectiveness. These techniques make performance prediction both effective and efficient in various retrieval settings including a Web search environment. As an application, we also provide a framework to address the problem of query expansion prediction. At the level of documents, we predict the quality of documents in the context of Web ad-hoc retrieval. We explore document features that are predictive of quality. Furthermore, we propose a document quality language model to improve retrieval effectiveness by incorporating quality information.