Recent Computer Science PH.D. Graduates (September 2015)

Abhigyan; Content Placement as a Key to Infrastructure Service Design for a Content-Dominated, Highly Mobile Internet; Arun Venkataramani, Ramesh Sitaraman, Advisors; Sept. 2015; Researcher, AT&T Labs Research.

Most of the Internet traffic is content, and most of the Internet connected hosts are mobile. Our work focuses on the design of infrastructure services needed to support such a content-dominated, highly mobile Internet. In the design of these services, three sets of decisions arise frequently: (1) content placement for selecting the locations where a content is placed, (2) request redirection for selecting the location where a particular request is served from and (3) network routing for selecting the physical path between clients and the services they are accessing. Our central thesis is that content placement is a powerful factor, and is often more powerful than redirection and routing, in determining the cost, performance and energy-related metrics for these services. In support of this thesis, we consider three types of infrastructures discussed below.

Internet service provider (ISP): In an ISP carrying content-dominated traffic, we show that combinations of simple placement and routing schemes are effective in optimizing an ISP's performance and cost objectives. Further, we show that effective content placement contributes more than optimizing network routing to achieve an ISP's objectives. Our findings question the value of traditional ISP traffic engineering schemes that optimize routing alone, while simplifying the task of traffic engineering for the operators.

Global name service (GNS): We design and implement the Auspice global name service, which resolves names to network addresses for highly mobile entities, thereby providing a key building block for establishing and maintaining communication between mobile entities in the Internet. A key distinction between Auspice and other name services is Auspice's demand-aware replica placement engine that intelligently places name records to provide low lookup latency, low update cost, and high availability. In our experiments, Auspice's placement scheme enables it to significantly outperform commercial managed DNS providers, DHT-based replication as well as static placement schemes.

Content datacenter (CDC): Content datacenters cache and serve content to improve user-perceived performance for content accesses. In a CDC, we quantify the tradeoff between energy savings via consolidation and the user-perceived performance impact based on a real CDC workload. A key insight, supported via experiments, is that despite server consolidation, a simple placement scheme is able to achieve cache hit rates close to an unconsolidated datacenter, which helps mitigate the impact of consolidation on user-perceived latencies. Further, our work is the first to propose a network-aware server consolidation scheme that enables additional network energy savings over network-unaware server consolidation schemes for common datacenter topologies.

Hannah Blau; Automated Style Feedback for Advanced Beginner Java Programmers; Robert Moll, W. Richards Adrion, Advisors; Sept. 2015.

FrenchPress is an Eclipse plug-in that partially automates the task of giving students feedback on their Java programs. It is designed not for novices but for students taking their second or third Java course: students who know enough Java to write a working program but lack the judgment to recognize bad code when they see it. FrenchPress does not diagnose compile-time or runtime errors, or logical errors that produce incorrect output. It targets silent flaws, flaws the student is unable to identify for himself because nothing in the programming environment alerts him.
 
FrenchPress diagnoses flaws characteristic of programmers who have not yet assimilated the object-oriented idiom. Such shortcomings include misuse of the public modifier, fields that should have been local variables, and instance variables that should have been class constants. Other rules address the all too common misunderstanding of the boolean datatype. FrenchPress delivers explanatory messages in a vocabulary appropriate for advanced beginners. FrenchPress does not fix the problems it detects; the student must decide whether to change the program.

I conducted a classroom trial of FrenchPress covering four programming assignments in the Fall 2014 data structures and algorithms course. Among students whose code triggered one or more of the diagnostic rules, the percentage who modified their program in response to FrenchPress feedback varied from a high of 59% on the first assignment to a low of 23% on the second and fourth assignments. User satisfaction surveys indicate that among students who said FrenchPress gave them suggestions for improvement, the percentage who found the feedback helpful bounced from around 55% on the first and third assignments to 33% on the second and fourth assignments.

John Bowers; Skeleton Structures and Origami Design; Ileana Streinu, Advisor; Sept. 2015; Assistant Professor, Dept of Computer Science, James Madison University.

In this dissertation we study problems related to polygonal skeleton structures that have applications to computational origami. The two main structures studied are the straight skeleton of a simple polygon (and its generalizations to planar straight line graphs) and theuniversal molecule of a Lang polygon. This work builds on results completed jointly with my advisor Ileana Streinu.

Skeleton structures are used in many computational geometry algorithms. Examples include the medial axis, which has applications including shape analysis, optical character recognition, and surface reconstruction; and the Voronoi diagram, which has a wide array of applications including geographic information systems (GIS), point location data structures, motion planning, etc.

The straight skeleton, studied in this work, has applications in origami design, polygon interpolation, biomedical imaging, and terrain modeling, to name just a few. Though the straight skeleton has been well studied in the computational geometry literature for over 20 years, there still exists a significant gap between the fastest algorithms for constructing it and the known lower bounds.

One contribution of this thesis is an efficient algorithm for computing the straight skeleton of a polygon, polygon with holes, or a planar straight-line graph given a secondary structure called the induced motorcycle graph.

The universal molecule is a generalization of the straight skeleton to certain convex polygons that have a particular relationship to a metric tree. It is used in Robert Lang's seminal TreeMaker method for origami design. Informally, the univer- sal molecule is a subdivision of a polygon (or polygonal sheet of paper) that allows the polygon to be "folded" into a particular 3D shape with certain tree-like properties. One open problem is whether the universal molecule can be rigidly folded: given the initial flat state and a particular desired final "folded" state, is there a continuous motion between the two states that maintains the faces of the subdivision as rigid panels? A partial characterization is known: for a certain measure zero class of universal molecules there always exists such a folding motion. Another open problem is to remove the restriction of the universal molecule to convex polygons. This is of practical importance since the TreeMaker method sometimes fails to produce an output on valid input due the convexity restriction and extending the universal molecule to non-convex polygons would allow TreeMaker to work on all valid inputs. One further interesting problem is the development of faster algorithms for computing the universal molecule.

In this thesis we make the following contributions to the study of the universal molecule. We first characterize the tree-like family of surfaces that are foldable from universal molecules. In order to do this we define a new family of surfaces we call Lang surfaces and prove that a restricted class of these surfaces are equivalent to the universal molecules. Next, we develop and compare efficient implementations for computing the universal molecule. Next, by investigating properties of broader classes of Lang surfaces, we arrive at a generalization of the universal molecule from convex polygons in the plane to non-convex polygons in arbitrary flat surfaces. This is of both practical and theoretical interest. The practical interest is that this work removes the case from Lang's TreeMaker method that causes TreeMaker to fail to produce output in the presence of non-convex polygons. The theoretical interest comes from the fact that our generalization encompasses more than just those surfaces that can be cut out of a sheet of paper, and pertains to polygons that cannot be lied flat in the plane without self-intersections. Finally, we identify a large class of universal molecules that are not foldable by rigid folding motions. This makes progress towards a complete characterization of the foldability of the universal molecule.

Ethem Can; Exploiting Concepts in Videos for Video Event Detection; R. Manmatha, James Allan, Advisors; Sept. 2015.

Video event detection is the task of searching videos for events of interest to a user where an event is a complex activity occurring at a specific place and time. The video event detection problem has gained more importance as the amount of online video has been increasing by more than 300 hours every minute only on Youtube.

In this thesis, we tackle three major video event detection problems: video event detection with exemplars (VED-ex), where a large number of example videos are associated with queries; video event detection with few exemplars, in which only a small number of example videos are associated with queries; and zero- shot video event detection (VED-zero), where no exemplar videos are associated with queries.

We first define a new way of describing videos concisely, one that is built around using query-independent concepts (e.g., a fixed set of concepts for all queries) with a space-efficient representation. Using query-independent concepts enables us to learn a retrieval model for any query without requiring a new set of concepts. Our space-efficient representation helps reduce the amount of time required to train/test a retrieval model and amount of space to store video representations on disk.

When the number of example videos associated with queries decreases, the retrieval accuracy decreases as well. We present a method that incorporates multiple one-exemplar models into video event detection aiming at improving retrieval accuracies when there are few exemplars available. By incorporating multiple one-exemplar models into video event detection with few exemplars, we are able to obtain significant improvements in terms of mean average precision compared to the case of a monolithic model.

Having no exemplar videos associated with queries makes the video event detection problem more challenging as we cannot train a retrieval model using example videos. It is also more realistic since compiling a number of example videos might be costly. We tackle this problem by providing a new and effective zero-shot video event detection model that exploits dependencies of concepts in videos. Our dependency work uses a Markov Random Field (MRF) based retrieval model and assumes three dependency settings: 1) full independence, where each concept is considered independently; 2) spatial dependence, where the co-occurrence of two concepts in the same video frame is treated as important; and 3) temporal dependence, where having concepts co-occur in consecutive frames is treated as important. Our MRF based retrieval model improves retrieval accuracies significantly compared to the common bag-of-concepts approach with an independence assumption.

Thomas Helmuth; General Program Synthesis from Examples using Genetic Programming with Parent Selection Based on Random Lexicographic Orderings of Test Cases; Lee Spector, Advisor; Sept. 2015; Assistant Professor of Computer Science, Washington and Lee Universtiy.

Software developers routinely create tests before writing code, to ensure that their programs fulfill their requirements. Instead of having human programmers write the code to meet these tests, automatic program synthesis systems can create programs to meet specifications without human intervention, only requiring examples of desired behavior. In the long-term, we envision using genetic programming to synthesize large pieces of software. This dissertation takes steps toward this goal by investigating the ability of genetic programming to solve introductory computer science programming problems.

We present a suite of 29 benchmark problems intended to test general program synthesis systems, which we systematically selected from sources of introductory computer science programming problems. This suite is suitable for experiments with any program synthesis system driven by input/output examples. Unlike existing benchmarks that concentrate on constrained problem domains such as list manipulation, symbolic regression, or boolean functions, this suite contains general programming problems that require a range of programming constructs, such as multiple data types and data structures, control flow statements, and I/O. The problems encompass a range of difficulties and requirements as necessary to thoroughly assess the capabilities of a program synthesis system. Besides describing the specifications for each problem, we make recommendations for experimental protocols and statistical methods to use with the problems.

This dissertation's second contribution is an investigation of behavior-based parent selection in genetic programming, concentrating on a new method called lexicase selection. Most parent selection techniques aggregate errors from test cases to compute a single scalar fitness value; lexicase selection instead treats test cases separately, never comparing error values of different test cases. This property allows it to select parents that specialize on some test cases even if they perform poorly on others. We compare lexicase selection to other parent selection techniques on our benchmark suite, showing better performance for lexicase selection. After observing that lexicase selection increases exploration of the search space while also increasing exploitation of promising programs, we conduct a range of experiments to identify which characteristics of lexicase selection influence its utility.

Bo Jiang; Network Characteristics and Dynamics: Reciprocity, Competition and Information Dissemination; Donald Towsley, Advisor; Sept. 2015; Postdoctoral Research Associate, UMass Amherst.

Networks are commonly used to study complex systems. This often requires a good understanding of the structural characteristics and evolution dynamics of networks, and also their impacts on a variety of dynamic processes taking place on top of them. In this thesis, we study various aspects of networks characteristics and dynamics, with a focus on reciprocity, competition and information dissemination.

We first formulate the maximum reciprocity problem and study its use in the interpretation of reciprocity in real networks. We propose to interpret reciprocity based on its comparison with the maximum possible reciprocity for a network exhibiting the same degrees. We show that the maximum reciprocity problem is NP-hard,  and use an upper bound instead of the maximum. We find that this bound is surprisingly close to the empirical reciprocity in a wide range of real networks, and that there is a surprisingly strong linear relationship between the two. We also show that certain small suboptimal motifs called 3-paths are the major cause for suboptimality in real networks.

Secondly, we analyze competition dynamics under cumulative advantage, where accumulated resource promotes gathering even more resource. We characterize the tail distributions of duration and intensity for pairwise competition.  We show that duration always has a power-law tail irrespective of competitors' fitness, while intensity has either a power-law tail or an exponential tail depending on whether the competitors are equally fit. We observe a struggle-of-the-fitness phenomenon, where a slight different in fitness results in an extremely heavy tail of duration distribution.

Lastly, we study the efficiency of information dissemination in social networks with limited budget of attention. We quantify the efficiency of information dissemination for both cooperative and selfish user behaviors in various network topologies. We identify topologies where cooperation plays a critical role in efficient information propagation. We propose an incentive mechanism called "plus-one'' to coax users into cooperation in such cases.

Yoonheui Kim; Application of Techniques for Map Estimation to Distributed Constraint Optimization Problem; Victor Lesser, Advisor; Sept. 2015.

The problem of efficiently finding near-optimal decisions in multi-agent systems has become increasingly important because of the growing number of multi-agent applications with large numbers of agents operating in real-world environments. In these systems, agents are often subject to tight resource constraints and agents have only local views. When agents have non-global constraints, each of which is independent, the problem can be formalized as a distributed constraint optimization problem (DCOP). The DCOP is closely associated with the problem of inference on graphical models. Many approaches from inference literature have been adopted to solve DCOPs. We focus on the Max-Sum algorithm and the Action-GDL algorithm that are DCOP variants of the popular inference algorithm called the Max-Product algorithm and the Belief Propagation algorithm respectively. The Max-Sum algorithm and the Action-GDL algorithm are well-suited for multi-agent systems because it is distributed by nature and requires less communication than most DCOP algorithms. However, the resource requirements of these algorithms are still high for some multi-agent domains and various aspects of the algorithms have not been well studied for use in general multi-agent settings.

This thesis is concerned with a variety of issues of applying the Max-Sum algorithms and the Action-GDL algorithm to general multi-agent settings. We develop a hybrid algorithm of ADOPT and Action-GDL in order to overcome the communication complexity of DCOPs. Secondly, we extend the Max-Sum algorithm to operate more efficiently in more general multi-agent settings in which computational complexity is
high. We provide an algorithm that has a lower expected computational complexity for DCOPs even with n-ary constraints. Finally, In most DCOP literature, a one-to-one mapping between a variable and an agent is
assumed. However, in real applications, many-to-one mappings are prevalent and can also be beneficial in terms of communication and hardware cost in situations where agents are acting as independent computing units. We consider how to exploit such mapping in order to increase efficiency.

Chia-Jung Lee; Exploiting Social Media Sources for Search, Fusion and Evaluation; W. Bruce Croft, Advisor; Sept. 2015.

The web contains heterogeneous information that is generated with different characteristics and is presented via different media. Social media, as one of the largest content carriers, has generated information from millions of users worldwide, creating material rapidly in all types of forms such as comments, images, tags, videos and ratings, etc. In social applications, the formation of online communities contributes to conversations of substantially broader aspects, as well as unfiltered opinions about subjects that are rarely covered in public media. Information accrued on social platforms, therefore, presents a unique opportunity to augment web sources such as Wikipedia or news pages, which are usually characterized as being more formal. The goal of this dissertation is to investigate in depth how social data can be exploited and applied in the context of three fundamental information retrieval (IR) tasks: search, fusion, and evaluation.

Improving search performance has consistently been a major focus in the IR community. Given the in-depth discussions and active interactions contained in social media, we present approaches to incorporating this type of data to improve search on general web corpora. In particular, we propose two graph-based frameworks, social anchor and information network, to associate related web and social content, where information sources of diverse characteristics can be used to complement each other in a unified manner. The experimental results show that augmenting document representations with social signals can significantly outperform a wide range of baselines, including performing pseudo-relevance feedback on social collections.

Presenting social media content to users is valuable particularly for queries intended for time-sensitive events or community opinions. Current major search engines commonly blend results from different search services (or verticals) into core web results. Motivated by this real-world need, we explores ways to merge results from different web and social services into a single ranked list. We present an optimization framework for fusion, where the impact of documents, ranked lists, and verticals can be modeled simultaneously to maximize performance. Our results demonstrate that modeling all types of impact together achieves the best effectiveness.

Evaluating search system performance has largely relied on creating reusable test collections in IR. While traditional evaluation can require substantial manual effort, we explore an approach to automating the process of collecting pairs of queries and relevance judgments, using the high quality social media, Community Question Answering (CQA). Our approach is based on the idea that CQA services support platforms for users to raise questions and to share answers, therefore encoding the associations between real user information needs and real user assessments. We conduct experiments to study and verify the reliability of these new, CQA-based evaluation test sets.

Mathew Vimal; Energy-Efficient Content Delivery Networks; Prashant Shenoy, Ramesh Sitaraman, Advisors; Sept. 2015; Self-Employed.

Internet-scale distributed systems such as content delivery networks (CDNs) operate hundreds of thousands of servers deployed in thousands of data center locations around the globe. Since the energy costs of operating such a large IT infrastructure are a significant fraction of the total operating costs, we argue for redesigning them to incorporate energy optimization as a first-order principle. We focus on CDNs and demonstrate techniques to save energy while meeting client-percieved service level agreements (SLAs) and minimizing impact on hardware reliability.

Servers deployed at individual data centers can be switched off at low load to save energy. We show that it is possible to save energy while providing client-perceived availability and limited impact on hardware reliability. We propose an optimal offline algorithm and an online algorithm to extract energy savings and evaluate them on real production workload traces. Our results show that it is possible to reduce the energy consumption of a CDN by 51% while ensuring a high level of availability and incurring an average of one on-off transition per server per day.

We propose a novel technique called cluster shutdown that switches off an entire cluster of servers, thus saving on both server and cooling power. We present an algorithm for cluster shutdown that is based on realistic power models for servers and cooling equipment and can be implemented as a part of the global load balancer of a CDN. We argue that cluster shutdown has intrinsic architectural advantages over server shutdown techniques in the CDN context, and show that it outperforms server shutdown in a wide range of operating regimes.

To reduce energy costs, we propose a demand-response technique that responds to pricing signals from a smart grid by deferring elastic load. We propose an optimal offline algorithm for demand response and evaluate it on production workloads from a commercial CDN using realistic electricity pricing models. We show that energy cost savings can be achieved with no increase in the bandwidth cost.

Aditya Mishra; Energy Optimizations for Smart Buildings and Smart Grids; Prashant Shenoy, Advisor; Sept. 2015; Assistant Professor, Seattle University.

Modern buildings are heavy power consumers. Nearly 75% of the total electricity consumed in the US is consumed in the residential and commercial buildings. This consumption is not evenly distributed over time. Typical consumption profile exhibits several peaks and troughs. The peakiness, in turn, dictates the electric grid's generation, transmission, distribution costs and associated carbon emissions. In this thesis, I discuss challenges involved in achieving the sustainability goals in buildings and electric grids. I investigate building and grid energy footprint optimization techniques to achieve the following goals: 1) making buildings energy efficient, 2) cutting building's electricity bills, 3) cutting utility's costs in electricity generation and distribution, 4) reducing carbon footprint, and 5) making the aggregate electricity consumption profile grid-friendly.

In this talk, I will start by briefly presenting SmartCap---a system to enable homes automatically flatten their demand profile by scheduling background loads (such as A/Cs, refrigerator)---and SmartCharge, an intelligent battery charging system that shifts a building's electricity consumption to low-cost off-peak periods, thereby cutting user electricity bills and shaving grid-wide peaks. Next, I will present PeakCharge, a system enabling large-scale distributed energy storage at buildings throughout the grid to flatten grid demand, while 1) maintaining end-user incentives for storage adoption at grid-scale, and 2) ensuring grid stability. Our evaluations show that total storage capacity required by PeakCharge to flatten grid demand is within 18% of the capacity required by a centralized system. After PeakCharge, I will talk about the efficacy of employing combinations of energy storage technologies at different levels of the grid's distribution hierarchy to cut utility's costs. Here, I will present an optimization framework for modeling the primary characteristics of various energy storage technologies and tradeoffs in placing them at different levels of the distribution hierarchy. Using consumption data from 5000 consumers, we show that by employing hybrid storage across the distribution hierarchy, utilities can reduce their daily distribution costs up to 12%. I will conclude my talk with a discussion of future research directions.

Brian Taylor; Informed Search for Learning Causal Structure; David Jensen, Advisor; Sept. 2015; Machine Learning Engineer, Amazon.com, Inc.

Over the past twenty-five years, a large number of algorithms have been developed to learn the structure of causal graphical models.  Many of these algorithms learn causal structures by analyzing the implications of observed conditional independence among variables that describe characteristics of the domain being analyzed.  They do so by applying inference rules, data analysis operations such as the conditional independence tests, each of which can eliminate large parts of the space of possible causal structures.  Results show that the sequence of inference rules used by PC, a widely applied algorithm for constraint-based learning of causal models, is effective but not optimal.

This is because algorithms such as PC ignore the probability of the outcomes of these inference rules.  We demonstrate how an alternative algorithm can reliably outperform PC by taking into account the probability of inference rule outcomes.  Specifically we show that an informed search that bases the order of causal inference on a prior probability distribution over the space of causal constraints can generate a flexible sequence of analysis that efficiently identifies the same results as PC.  This class of algorithms is able to outperform PC even under uniform or erroneous priors.

Philip Thomas; Safe Reinforcement Learning; Andrew Barto, Advisor; Sept. 2015; Postdoctoral Fellow, School of Computer Science, Carnegie Mellon University.

This thesis proposes and presents solutions to two new problems that fall within the broad scope of reinforcement learning (RL) research. The first problem, high confidence off-policy evaluation (HCOPE), requires an algorithm to use historical data from one or more behavior policies to compute a high confidence lower bound on the performance of an evaluation policy. This allows us to, for the first time, provide the user of any RL algorithm with confidence that a newly proposed policy (which has never actually been used) will perform well.

The second problem is to construct what we call a safe reinforcement learning algorithm---an algorithm that searches for new and improved policies, while ensuring that the probability that a "bad" policy is proposed is low. Importantly, the user of the RL algorithm may tune the meaning of "bad" (in terms of a desired performance baseline) and how low the probability of a bad policy being deployed should be, in order to capture the level of risk that is acceptable for the application at hand.

We show empirically that our solutions to these two critical problems require surprisingly little data, making them practical for real problems. While our methods allow us to, for the first time, produce convincing statistical guarantees about the performance of a policy without requiring its execution, the primary contribution of this thesis is not the methods that we propose. The primary contribution of this thesis is a compelling argument that these two problems, HCOPE and safe reinforcement learning, which at first may seem out of reach, are actually tractable. We hope that this will inspire researchers to propose their own methods, which improve upon our own, and that the development of increasingly data-efficient safe reinforcement learning algorithms will catalyze the widespread adoption of reinforcement learning algorithms for suitable real-world problems.

Sookhyun Yang; Forensic and Management Challenges in Wireless and Mobile Network Environments; James Kurose, Advisor; Sept. 2015; Senior Performance Engineer, Akamai Technology.

The Internet recently passed an historic inflection point, with the number of broadband wireless/mobile devices surpassing the number of wired PCs and servers connected to the Internet. Smartphones, laptops, tablets, machine-to-machine (M2M) devices, and other portable devices have penetrated our daily lives.  According to Cisco, by 2018, wired devices will account for only 39% of IP traffic, with the remaining traffic produced by wireless/mobile devices. This proliferation of wireless/mobile devices is profoundly changing many of the characteristics of network applications, protocols, and operation, and posing fundamental challenges to the Internet architecture. In light of this new trend, this thesis focuses on forensic and mobility-management challenges in wireless/mobile network environments.

The first half of this thesis addresses two network-forensic challenges that arise due to the broadcast nature of wireless communications. In the first network-forensic challenge, we develop a mechanism to detect anomalous forwarding behaviors such as packet dropping, and packet reordering, and to identify the source of forwarding-behavior attacks that can disrupt a wireless ad hoc network. Our mechanism employs witness nodes that can overhear transmissions made by nearby wireless network nodes. In the second challenge, we investigate a method for gathering network-based evidence, based on constraints imposed by current U.S. law, for remotely disambiguating a sender's network access type (wired versus wireless); such a technique could be used to determine that a sender is connected physically to a network inside a building. We discuss several factors that might affect our classification results and identify the scenarios in which residential network access type can be accurately determined.

The second half of this thesis takes a more global and network-level point of view on mobility management and delves into a clean-state approach to designing a future Internet architecture that considers mobility as a first-order property. Before discussing architectural design issues, we present a measurement and modeling study of user transitioning among points of attachment to today's Internet. These transitions could result from a user's physical mobility or a stationary "multi-homed" user's changing his/her devices or NICs. This research provides insights and implications regarding control-plane workload for a mobility-management architecture. Our measurement results to date show that users spend the majority of their time attached to a small number of networks, and that a surprisingly large number of users access two networks contemporaneously. In the last part of our thesis research, we design techniques for efficiently handling group mobility in the context of the MobilityFirst architecture; MobilityFirst uses flat, globally unique names, binding a flat name to its network location via a logically centralized name- and location-resolution server.  Using the empirical model from our measurement study as well as more abstract models of group mobility, we evaluate our group mobility management techniques.