RECENT COMPUTER SCIENCE PH.D. GRADUATES (FEBRUARY AND MAY 2015)

Md. Ashraful Alam; Reconstructing Geometric Structures from Combinatorial and Metric Information; Ileana Streinu, Advisor; Feb. 2015; Software Engineer, Intel Corporation

In this dissertation, we address three reconstruction problems. First, we address the problem of reconstructing a Delaunay triangulation from an outer planar graph. We present a new proof that any outer planar graph is Delaunay realizable.

Given a convex polyhedron P and a point s on the surface (the source), the ridge tree or cut locus is a collection of points with multiple shortest paths from s on the surface of P. If we cut the surface of P along the shortest paths from s to the vertices of P, we obtain a planar polygon called shortest path star (sp-star) unfolding. Given a combinatorial structure of a ridge tree, we show how to construct the ridge tree and the sp-star unfolding in which it lies. In this process, we address several problems concerning the existence of sp-star unfoldings on specified source point sets and present some positive and negative results.

Finally, we introduce and study a new variant of the sp-star unfolding, called (geodesic)star unfolding where we cut the surface of the convex polyhedron along a set of non-crossing geodesics (not-necessarily the shortest). We show how to construct a geodesic star unfolding from a special class of points and also consider the reconstruction problem : given a star unfolding of some convex polyhedron P and a source point s, how we can construct the sp-star unfolding of P with respect to s. We introduce a new operation, called cut and paste operation and show, using experimental data, that we can reconstruct the sp-star unfolding from a given geodesic star unfolding by applying this cut-paste operation.

Gordon Anderson; An Analysis of Two Student Learning Strategies in High Enrollment Computer-based College Courses; Robert Moll, Advisor; May 2015; Lecturer, Department of Computer Science, UMass Amherst

The subject of this thesis is an observational investigation of the merits of two commonly held beliefs about student learning strategies in two large, STEM (Science, Technology, Engineering, and Math) college courses: 1) students don't read their text books before working homework problems and don't learn as well as a result, and 2) students who do their work just before it's due don't learn as well because they are depriving themselves of the time necessary to fully consider the problems they have to solve. In order to assess the validity of these claims, we created features to measure the amount of textbook interaction and the amount of work submitted close to due dates. Our hypotheses were: 1) more interaction with the text before starting homework would result in higher exam scores, especially for novice students, and 2) the more work submitted close to a due date would result in lower than average exam scores. Our analysis showed a statistically significant (p<.05) effect of following a book-first strategy for students with no previous programming experience in the Computer Science course, as well as for the chemistry students. We were able to find a significant effect for one of the groups in our study, though students who submitting work within an hour of a due date had significantly lower exam scores on average than peers who did not work that late. We conclude that text interaction has a positive effect on outcomes for novice students, and that though many students work close to due dates, it does not seem to be as detrimental to outcomes as many believe.

Boulat Bash; Fundamental Limits of Covert Communication; Donald Towsley, Advisor; Feb. 2015; Postdoctoral Fellow, BBN Technologies

We present a square root limit on low probability of detection (LPD) communication over additive white Gaussian noise (AWGN) channels. Specifically, if a warden has an AWGN channel to the transmitter with non-zero noise power, we prove that $o(\sqrt{n})$ bits can be sent from the transmitter to the receiver in n AWGN channel uses with probability of detection by the warden less than $\epsilon$ for any $\epsilon>0$. Moreover, in most practical scenarios, the lower bound on the noise power on warden's channel to the transmitter is known and $O(\sqrt{n})$ bits can be covertly sent in n channel uses. Conversely, attempting to transmit more than $O(\sqrt{n})$ bits either results in detection by the warden with probability one or a non-zero probability of decoding error as n goes to infinity. Further, we show that LPD communication on the AWGN channel allows one to send a non-zero symbol on every channel use, in contrast to what might be expected from the square root law found recently in image-based steganography.

Bruno Castro da Silva; Learning Parameterized Skills; Andrew G. Barto, Advisor; Feb. 2015; Postdoctoral Associate, Laboratory for Information and Decision Systems, MIT

One of the defining characteristics of human intelligence is the ability to acquire and refine skills. Skills are behaviors for solving problems and performing tasks that an agent encounters often--sometimes in different contexts and situations--throughout its lifetime. Identifying important problems that recur and retaining their solutions as skills allows an agent to more rapidly solve novel problems by adjusting and combining its existing skills.

In this thesis we introduce a general framework for learning reusable parameterized skills. Reusable skills are parameterized procedures that--given a description of a problem to be solved or task to be performed--produce appropriate behaviors or policies. They can be sequentially and hierarchically combined with other skills and primitive actions to produce progressively more abstract and temporally extended behaviors.

We identify three major challenges involved in the construction of such skills. First, an agent should be capable of solving a small number of problems and generalizing these experiences to construct a single reusable skill. The skill should be capable of producing appropriate behaviors even when applied to yet unseen variations of a problem. We introduce a method for estimating properties of the lower-dimensional manifold on which problem solutions lie. This allows for the construction of unified models capable of predicting policy parameters from task parameters.

Secondly, an agent should be able to identify when a parameterized skill can be hierarchically decomposed into specialized sub-skills. We observe that the policy manifold may be composed of disjoint, piecewise-smooth charts, each one typically encoding solutions for a subclass of related problems. Identifying and modeling sub-skills allows for the autonomous aggregation of related parameterized behaviors into single, more abstract skills.

Finally, the agent should be able to actively select on which problems it wishes to practice in order to more rapidly become competent in a skill. Thoughtful and deliberate practice is one of the defining characteristics of human expert performance. By carefully choosing on which problems to practice the agent may be able to more rapidly construct a skill that performs well over a wide range of related problems.

We address these challenges by introducing a general framework for learning parameterized skills. We evaluate it on challenging simulated decision-making problems and on a physical humanoid robot, and we demonstrate that it allows for the efficient and active construction of reusable skills from limited data.

Yung-Chih Chen; Robust Mobile Data Transport: Modeling, Measurements, And Implementation; Donald Towsley, Advisor; May 2015; Senior Performance Engineer, Akamai Technologies

Advances in wireless technologies and the pervasive influence of multi-homed devices have significantly changed the way people use the Internet. These changes of user behavior and the evolution of multi-homing technologies have brought a huge impact to today's network study and provided new opportunities to improve mobile data transport. In this thesis, we investigate challenges related to human mobility, with emphases on network performance at both system level and user level. More specifically, we seek to answer the following two questions: 1) How to model user mobility in the networks and use the model for network provisioning? 2) Is it possible to utilize network diversity to provide robust data transport in wireless environments?

We first study user mobility in a large scale wireless network. We propose a mixed queueing model of mobility and show that this model can accurately predict both system-level and user-level performance metrics. Furthermore, we demonstrate how this model can be used for network dimensioning.

Secondly, with the increasing demand of multi-homed devices that interact with heterogeneous networks such as WiFi and cellular 3G/4G, we explore how to leverage this path diversity to assist data transport. We investigate the technique of multi-path TCP (MPTCP) and evaluate how MPTCP performs in the wild through extensive measurements in various wireless environments using WiFi and cellular 3G/4G simultaneously. We study the download latencies of MPTCP when using different congestion controllers and number of paths under various traffic loads and over different cellular carriers.

We further study the impact of short flows on MPTCP by modeling MPTCP's delay startup mechanism of additional flows. As flow sizes increase, we observe that traffic in cellular networks exhibits large and varying packet round trip times, called bufferbloat. We analyze the phenomenon of bufferbloat, and illustrate how bufferbloat can result in numerous MPTCP performance issues. We provide an effective solution to mitigate the performance degradation.

Finally, as popular content is replicated at multiple locations, we develop mechanisms that take advantage of this source diversity along with path diversity to provide robust mobile data transport. We demonstrate this in the context of online video streaming, because of its popularity and significant contribution to Internet traffic. We therefore propose MSPlayer, a client-based solution for online video streaming that adjusts network traffic distribution over each path to network dynamics. MSPlayer bypasses the deployment limitations of MPTCP while maintaining the benefits of path diversity, and exploits different content sources simultaneously. MSPlayer can significantly reduce video start-up latency and quickly refill playout buffer for high quality video streaming. We evaluate MSPlayer's performance through YouTube.

Stefan Christov; Model-Based Guidance for Human-Intensive Processes; Lori A. Clarke, George Avrunin, Advisors; Feb. 2015; Assistant Professor of Software Engineering, Quinnipiac University

Human-intensive processes (HIPs), such as medical processes involving coordination among doctors, nurses, and other medical staff, often play a critical role in society. We use the word "process" to refer to the coordination of activities to achieve a task or a goal, where the activities may be performed by humans, devices, or software systems. We say that a process is "human-intensive" if the contributions of human process performers have a significant impact on the process outcomes and require substantial domain expertise and insight. Despite considerable work and progress in error reduction, human errors are still a major concern for many HIPs. For example, human errors in medical HIPs constitute a large part of the preventable medical errors estimated to cause the death of between 98,000 and 400,000 people each year in the U.S.

To address this problem of human errors in HIPs, this thesis investigates two approaches for online process guidance, i.e., for guiding process performers while a process is being executed. Both approaches rely on monitoring a process execution and base the guidance they provide on a detailed formal process model that captures the recommended ways to perform the corresponding HIP. The first approach, which we call deviation detection and explanation, automatically detects when an executing HIP deviates from a set of recommended executions of that HIP, as specified by the process model. Such deviations could represent errors and, thus, detecting and reporting deviations as they occur could help catch errors before something bad happens. The approach also provides information to help explain a detected deviation to assist process performers with identifying potential errors and with planning recovery from these errors. The second approach, which we call process state visualization, proactively guides process performers by showing them information relevant to the current process execution, such as the activities that need to be performed at each point of that process execution. The goal of the process state visualization approach is to reduce the number of human errors.

The success of the online process guidance approaches presented in this work depends on the correctness and the degree of completeness of the underlying process model. If the process model is incorrect with respect to some set of requirements or it is incomplete by not representing all the process executions that domain experts have agreed should be captured in the model, then the online guidance may be incorrect or there might not be enough information in the process model to provide guidance in certain scenarios. Given the complexity of some HIPs, creating a correct and sufficiently complete process model is challenging. This thesis investigates process elicitation techniques to help create such process models and discusses an evaluation of these techniques based on their application to real-world HIPs.

The major contributions of this work can be summarized as follows:

Compared the relative strengths and weaknesses of several techniques for process elicitation and process model validation to help create correct and sufficiently complete process models needed for the proposed online process guidance approaches.
Developed an approach for deviation detection and explanation and evaluated it with realistic process models and synthetic process executions with seeded errors.
Recognized delayed deviation detection as a potential obstacle for the approach and investigated its frequency and consequences.
Developed an initial approach for visualization of process execution state and demonstrated it on a medical case study.

Boduo Li; A Platform for Scalable Low-Latency Analytics using MapReduce; Yanlei Diao, Advisor; May 2015; Researcher, NEC Laboratories America

MapReduce has gained tremendous popularity as a scalable, easy-to-use parallel programming model for large-scale data analytics. However, MapReduce is tailored for batch processing: (1) Data needs to be loaded to the cluster before any queries can be run, resulting in a high delay to start query processing. (2) Answers to a long-running query are returned only when the entire job completes, causing a long delay in returning query answers. Such delays are undesirable for the arising needs to process not only "big data" but also "fast data", such as interactive analysis where users are waiting online for query answers over stored large datasets, or near real-time monitoring and decision-making problems over high-volume continuous data streams. To respond to these needs for low-latency analytics, a variety of new systems have been developed. However, they have abandoned the MapReduce implementation, as well as many benefits of MapReduce. Moreover, recent real-world case studies from industry show that maintaining multiple systems for data analytics results in serious issues, such as a great deal of code duplication, complexity in coordinating multiple processes, and manual partitioning of the workloads to determine which data is processed by which system. These issues point to the need for a general, unified data processing framework to support analytical queries with different latency requirements.

Towards the goal of building a unified data analytics system, this dissertation presents the design, implementation, and evaluation of a platform for scalable low-latency analytics based on MapReduce. The proposed platform fundamentally transforms the existing parallel processing paradigm in MapReduce into an incremental parallel processing paradigm, which is the key to remove the barrier in the existing MapReduce implementation to generating query answers with low latency. With further necessary architectural changes, model-based configuration optimization, and latency-aware runtime scheduling, the proposed platform is enabled to take streaming input, and targets at maximizing the utility given user-defined constraints.

The first part of the dissertation explores the eligibility of the existing MapReduce implementations for low-latency analytics. Our extensive benchmark study of Hadoop-based MapReduce systems shows that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental computation, which is required by low-latency analytics. We further proposed a MapReduce cost model, and optimize the system configuration based on the model. The benchmark results over the optimized system verify that the barrier to incremental computation is intrinsic, and cannot be removed by tuning system parameters.

In the second part of the dissertation, we employ various purely hash-based techniques to enable fast in-memory incremental processing in MapReduce, and frequent key based techniques to extend such processing to workloads that require memory more than available. We evaluate our Hadoop-based prototype equipped with all proposed techniques. The resulting system is able to perform batch processing with significantly shorter running time due to improved CPU and I/O cost, and output early results during a job due to incremental processing.

In the third part of the dissertation, we perform a thorough analysis to understand the reasons that can still cause high latency in our incremental processing platform. We then propose a number of necessary architectural changes in MapReduce, and augment it with new latency-aware model-driven configuration and latency-aware runtime scheduling techniques to meet user-specified latency requirements while maximizing throughput. Experiments using real-world workloads show that our techniques reduce the latency from tens or hundreds of seconds in Hadoop and our previous Hadoop-based prototype to sub-second in our new system, with 2x-7x increase in throughput. Our system also outperforms Storm, a commercial-grade distributed stream system, and Spark Streaming, a state-of-the-art academic prototype, by a wide margin.

Jason Naradowsky; Learning with Joint Inference and Latent Linguistic Structure in Graphical Models; David A. Smith, Advisor; Feb. 2015; Research Associate, Machine Reading Lab, University College London

A human listener, charged with the difficult task of mapping language to meaning, must infer a rich hierarchy of linguistic structures, beginning with an utterance and culminating in an understanding of what was spoken. Much in the same manner, developing complete natural language processing systems requires the processing of many different layers of linguistic information in order to solve complex tasks, like answering a query or translating a document.

Historically the community has largely adopted a "divide and conquer" strategy, choosing to split up complex tasks into subproblems which can be tackled independently, with the hope that these smaller contributions will also yield benefits to NLP systems as a whole. These individual components can be laid out in a pipeline and processed in turn, one system's output becoming input for the next. This approach poses two problems. First, errors propagate, and, much like the childhood game of "telephone", combining systems in this manner can lead to unintelligible outcomes. Second, each component task requires annotated training data to act as supervision for training the model. These annotations are often expensive and time-consuming to produce, may differ from each other in genre and style, and may not match the intended application.

In this dissertation we pursue novel extensions of joint inference techniques for natural language processing. We present a framework that offers a general method for constructing and performing inference using graphical model formulations of typical NLP problems. Models are composed using weighted Boolean logic constraints, inference is performed using belief propagation. The systems we develop are composed of two parts: one a representation of syntax, the other a desired end task (part-of-speech tagging, semantic role labeling, named entity recognition, or relation extraction). By modeling these problems jointly, both models are trained in a single, integrated process, with uncertainty propagated between them. This mitigates the accumulation of errors typical of pipelined approaches. We further advance previous methods for performing efficient inference on graphical model representations of combinatorial structure, like dependency syntax, extending it to various forms of phrase structure parsing.

Laura Sevilla Lara; Long Range Motion Estimation and Applications; Erik Learned-Miller, Advisor; Feb. 2015; Postdoctoral Researcher, Max-Planck Institute, Germany

Finding correspondences between images underlies many computer vision problems, such as optical flow, tracking, stereovision and alignment. Finding these correspondences involves formulating a matching function and optimizing it. This optimization process is often gradient descent, which avoids exhaustive search, but relies on the assumption of being in the basin of attraction of the right local minimum. This is often the case when the displacement is small, and current methods obtain very accurate results for small motions.

However, when the motion is large and the matching function is abrupt this assumption is less likely to be true. One traditional way of avoiding this abruptness is to smooth the matching function spatially by blurring the images. As the displacement becomes larger, the amount of blur required to smooth the matching function becomes also larger. This averaging of pixels leads to a loss of detail in the image. Therefore, there is a trade-off between the size of the objects that can be tracked and the displacement that can be captured.

In this thesis we address the basic problem of increasing the size of the basin of attraction in a matching function. We use an image descriptor called distribution fields (DFs). By blurring the images in DF space instead of in pixel space, we increase the size of the basin attraction with respect to traditional methods. We show competitive results using DFs both in object tracking and optical flow. Finally we demonstrate an application of capturing large motions for temporal video stitching.

Borislava Simidchieva; Variation in Human-Intensive Systems: a Conceptual Framework for Characterizing, Modeling, and Analyzing Families of Systems; Leon Osterweil, Advisor; May 2015; Scientist, Distributed Systems group, BBN Technologies

A system model---namely a formal definition of the coordination of people and software components performing activities, using resources and artifacts, and producing various outputs---can aid understanding of the real-world system it models. Complex real-world systems, however, exhibit considerable amounts of variation that can be difficult or impossible to represent within a single model. This dissertation evaluates the hypothesis that the careful characterization and representation of system variation can aid in the generation and analysis of concrete system instances related to one another in specified ways and manifesting different kinds of variation. When a set of closely related systems can be characterized by a compelling set-membership criterion, it is often useful and appropriate to characterize the set as a family of systems. In this dissertation, a variety of needs for system variation and for family criteria are identified. We focus on two specific kinds of variation, namely functional and agent variation, and suggest an approach for meeting these needs both at the level of requirements specification (problem-level variation), as well as at the level of implementation specification (solution-level variation).

We present a framework for generating and analyzing new system instances, using the Little-JIL process definition language as an experimental vehicle to study what process definition language capabilities are necessary to support the explicit modeling of variation at the solution level, and thereby to address needs at the problem level. We define a formal notation for specifying several different scenarios of functional and agent variation in human-intensive processes and describe a prototype system to accommodate this specification within an existing modeling framework. Once a family of systems is formally defined and characterized at the solution level, different analysis techniques can be applied to make assurances that all members of the family share certain kinds of properties. These analysis results can then be used to inform variation needs at the problem level. To evaluate the applicability of the approach, we study and model the variation observed in two real-world, human-intensive systems from the domains of conflict resolution and elections. Both case study domains have been observed to employ functional variants of their processes, and, given their complex coordination of human and software agents, both domains require agent variation, therefore fostering a fruitful application of our approach.

Michael Wick; Epistemological Databases for Probabilistic Knowledge Base Construction; Andrew McCallum, Advisor; Feb. 2015; Senior Member of Technical Staff, Oracle Labs East

Knowledge bases (KB) facilitate real world decision making by providing access to structured relational information which enables pattern discovery and semantic queries. Although there is a large amount of data available for populating KBs, this data must first be gathered and assembled. Traditionally, automatic methods of information extraction perform this integration by predicting the sets of entities and relations to be stored in the database. However, the resulting KB is often not reliable because (a) errors accumulate in the integration pipeline, and (b) they persist in the KB even after new information arrives that could rectify these errors.

We will address these concerns by studying a new paradigm for knowledge base construction that we term an "epistemological" database. In epistemological databases the existence and properties of entities are not directly input into the DB; they are instead determined by statistical inference on raw evidence input into the DB. This shift in thinking is important because it allows inference to revisit previous conclusions and retroactively correct errors as new evidence arrives. Evidence is abundant and in steady supply from spiders, semantic web ontologies, external databases, and even groups of enthusiastic human editors. As this evidence continues to accumulate and inference continues to run in the background, the quality of the knowledge base continues to improve. In this dissertation we will develop the machine learning components necessary to achieve epistemological knowledge base construction at scale by exploring both novel and existing techniques in modeling, inference, and learning.

Dan Xie; An Opportunistic Service Oriented Approach for Robot Search; Allen Hanson, Roderic Grupen, Advisors; Feb. 2015; Senior Software Engineer, Autodesk Inc.

Health care for the elderly poses a major challenge as the baby boom generation ages. Part of the solution is to develop technology using sensor networks and service robotics to increase the length of time that an elder can remain at home. Since moderate immobility and memory impairment are common as people age, a major problem for the elderly is locating and retrieving frequently used "common" objects such as keys, cellphones, books, etc. However, for robots to assist people while they search for objects, they must possess the ability to interact with the human client, complex client-side environments, and heterogeneous sensorimotor resources. Given this complexity, the traditional approach of developing particular control strategies in a top-down manner is not suitable.

In this dissertation an opportunistic service-oriented approach is presented to address the robot search problem in residential eldercare. With the presented approach, a hierarchy of search strategies is developed in a bottom-up manner from passive object detection and retrieval performed by embedded camera sensors to context-aware cooperative search performed by a human-robot team. By opportunistically employing available sensorimotor resources, the robotic application achieves increased search performance, and has the flexibility to balance between performance goals and resource constraints. To evaluate the proposed approach, I describe several experiments with a robot-sensor network that includes the UMass uBot-5, Pan-Tilt-Zoom cameras, and wireless sensors. The results of these experiments suggest that the robot search application based on the proposed approach can lead to efficient search performance and great flexibility in resource-constrained environments.

Limin Yao; Universal Schema for Knowledge Representation from Text and Structured Data; Andrew McCallum, Advisor; Feb. 2015; Software Engineer, Twitter Inc.

In data integration we transform information from a source into a target schema. A general problem in this task is loss of fidelity and coverage: the source expresses more knowledge than that can be fit into the target schema, or knowledge that is hard to fit into any schema at all. This problem is taken to an extreme in information extraction (IE) where the source is natural language---one of the most expressive forms of knowledge representation. To address this issue, one can either automatically learn a latent schema emergent in text (a brittle and ill-defined task), or manually define schemas. We propose instead to store data in a probabilistic representation of universal schema. This schema is simply the union of all source schemas, and we learn how to predict the cells of each source relation in this union. For example, we could store Freebase relations and relations that are expressed by natural language surface patterns. To populate such a database of universal schema, we present matrix factorization models that learn latent embedding vectors for entity tuples and relations.

We show that such latent models achieve substantially higher accuracy than a traditional classification approach on New York Times and Freebase data. Besides binary relations, we also use universal schema for unary relations, i.e., entity types. We also explore various facets of universal schema matrix factorization models on a large-scale web corpus, including implicature among the relations. We also evaluate our approach on the task of question answering using features obtained from universal schema, achieving state-of-the-art accuracy on a benchmark dataset.

RECENT COMPUTER SCIENCE PH.D. GRADUATES (FEBRUARY AND MAY 2015)

Subscribe to the CICS eNewsletter