Recent Computer Science Ph.D. graduates (September 2014)

Recent Computer Science Ph.D. graduates (September 2014)

John Altidor; Subtyping with Generics: A Unified Approach; Yannis Smaragdakis, Advisor; Sept. 2014; Researcher, BBN Technologies

Variance is a key programming language mechanism for writing reusable software. Variance is concerned with the interplay of parametric polymorphism (i.e., generics) and subtype (inclusion) polymorphism. Integrating parametric and subtype polymorphism while maintaining type safety is a difficult problem. Existing variance mechanisms enable greater subtyping, but they suffer from severe deficiencies.This dissertation aims to improve variance mechanisms in programming languages supporting parametric polymorphism. To address the shortcomings of current mechanisms, I developed type-safe languages that supported both definition-site and use-site variance. I developed software that analyzed large, standard Java libraries (e.g., Oracle's JDK 1.6) and computed metrics to measure the benefits of adding definition-site variance to Java, which only supports use-site variance. I found that 21-47% (depending on the library) of generic definitions are inferred to have single-variance; and up to 100% of existing wildcard annotations are unnecessary and can be elided.

Sean Barker; Model-Driven Analytics of Energy Meter Data in Smart Homes; Prashant Shenoy, Advisor; Sept. 2014; Visiting Assistant Professor, Department of Computer Science, Bowdoin College

The proliferation of smart meter deployments has led to significant interest in analyzing home energy use as part of the emerging 'smart grid'.  Meter data is often difficult to analyze, however, owing to device aggregation and coarse measurement granularities.  In this thesis, I present an architecture for enabling meter-driven smart homes and consider four challenges within this domain: (1) low-overhead data collection and processing for many devices, (2) designing models characterizing the energy use of modern devices, (3) tracking the real-time behavior of known devices, and (4) automatic identification of unknown devices in the home.  I first characterize the basic device types present in today's homes and construct a set of models that accurately represents complex real-world devices.  I then leverage this modeling framework to track the behavior of specific devices in close to real-time.  Finally, I present a learning-driven technique to automatically identify unknown devices attached to smart outlets.

Michael Crouch; Streaming Algorithms via Reductions; Andrew McGregor, Advisor; Sept. 2014; Member of Technical Staff, Bell Labs, Ireland

In the streaming algorithms model of computation we must process data "in order" and without enough memory to remember the entire input. We study reductions between problems in the streaming model with an eye to using reductions as an algorithm design technique. Our contributions include:
* "Linear Transformation" reductions, which compose with existing linear sketch techniques. We use these for small-space algorithms for numeric measurements of distance-from-periodicity, finding the period of a numeric stream, and detecting cyclic shifts.
* The first streaming graph algorithms in the "sliding window" model, where we must consider only the most recent L elements for some fixed threshold L. We develop basic algorithms for connectivity and unweighted maximum matching, then develop a variety of other algorithms via reductions to these problems.
* A new reduction from maximum weighted matching to maximum unweighted matching. This reduction immediately yields improved approximation guarantees for maximum weighted matching in the semistreaming, sliding window, and MapReduce models, and extends to the more general problem of finding maximum independent sets in p-systems.
* Algorithms in a "stream-of-samples" model which exhibit clear sample vs. space tradeoffs. These algorithms are also inspired by examining reductions. We provide algorithms for calculating Fk frequency moments and graph connectivity.

William Dabney; Adaptive Step-Sizes for Reinforcement Learning; Andrew Barto, Advisor; Sept. 2014; Machine Learning Scientist, Amazon.com

The central theme motivating this dissertation is the desire to develop reinforcement learning algorithms that ``just work" regardless of the domain in which they are applied. The largest impediment to this goal is the sensitivity of  reinforcement learning algorithms to the step-size parameter used to rescale incremental updates. Adaptive step-size algorithms attempt to reduce this sensitivity or eliminate the step-size parameter entirely by automatically adjusting the step size throughout the learning process. Such algorithms provide an alternative to the standard ``guess-and-check" methods used to find parameters known as parameter tuning.
However, the problems with parameter tuning are currently masked by the way experiments are conducted and presented. In this dissertation we seek algorithms which perform well over a broad subset of reinforcement learning problems with minimal parameter tuning.  To accomplish this we begin by addressing the limitations of current empirical methods in reinforcement learning and propose improvements with benefits far outside the area of adaptive step-sizes.

In order to study adaptive step-sizes in reinforcement learning we show that the general form of the adaptive step-size problem is a combination of two dissociable problems (adaptive scalar step-size and update whitening). We then derive new parameter-free adaptive scalar step-size algorithms for the reinforcement learning algorithm Sarsa(lambda) and use our improved empirical methods to conduct a thorough experimental study of step-size algorithms in reinforcement learning. Our adaptive algorithms (VES and PARL2) both eliminate the need for a tunable step-size parameter and perform at least as well as Sarsa(lambda) with an optimized step-size value. We conclude by developing natural temporal difference algorithms that provide an approximate solution to the update whitening problem and improve performance over their non-natural counterparts.

Junghee Jo; Defining, Evaluating, and Improving the Verification of Patient Identifiers During Medication Administration; Lori A. Clarke and  Jenna Marquard, Advisors; Sept. 2014; Senior Researcher, Electronics and Telecommunications Research Institute (ETRI)

Patient identification errors are a major cause of medication errors. During medication administration, failure to identify patients correctly can lead to patients receiving incorrect medications, perhaps resulting in adverse drug events and even death. Most medication error studies to date have focused on reporting patient misidentification statistics from case studies, on classifying types of patient identification errors, or on evaluating the impact of technology on the patient identification process, but few have proposed specific strategies or guidelines to decrease patient identification errors. This thesis aims to improve the verification of patient identifiers (VPI) process by making three key contributions to the patient identification literature.

First, to better understand the VPI process, we extended and formalized the requirements for VPI based on the Joint Commission's national patient safety guidelines. We showed the implications of these extended guidelines by applying them to artifacts typically used during medication administration (e.g., patient's statements about their identity, patient's identification band, medication label, and medication order). We found that nurses must choose from a considerable number of alternatives to comply with the extended guidelines. The alternatives vary depending on whether an artifact can be trusted prior to the start of the VPI process (from 16 to 8 for manual medication administration; from 8 to 1 for barcode medication administration), or what kind of information is encoded on the barcodes if the process involves barcode verification technology (from 3 to 1 when the ID band is initially considered a trusted artifact; from 8 to 3 when the ID band is not initially considered a trusted artifact).

Second, we evaluated whether nurses complied with the extended VPI guidelines when administering medications, using data from clinical simulations. Nurses' compliance with the extended guidelines was low under most conditions (2% - 5% for manual medication administration; 12% - 88% for barcode medication administration)

Third, we hypothesized that compliance would improve if healthcare workers were trained to follow a specific sequence of actions for VPI during medication administration (termed definitive procedure-based training), rather than their current training. We evaluated nursing students' compliance with the extended VPI guidelines using clinical simulation, with each student completing a simple task (administering one medication) and a complex task (administering two medications). We found that those students who received the definitive procedure-based training showed a significant increase in compliance on the simple task, but not on the complex task. Among the complying students, few of them followed the specific sequence of actions detailed in the definitive procedure-based training.

Our findings suggest further study is needed to investigate more effective approaches for improving the VPI process, perhaps by better supporting individuals as they complete the process (e.g., appropriately designed technology) or by improving approaches to training.



Youngho Kim; Searching Based on Query Documents; W. Bruce Croft, Advisor; Sept. 2014; Postdoctoral Research Scientist, IBM Watson Research

Domain-specific searches can be very different from general web search. One unique characteristic in domain-specific searches is that many search tasks start with query documents. For example, in patent retrieval, one major search task is finding relevant information for new query patents. Another typical characteristic is that the search process can take longer and be more comprehensive, compared to web search. As an example, to complete a single patent retrieval task, a typical user may generate 15 queries and examine more than 100 retrieved documents. In these search environments, searchers need to formulate multiple queries based on query documents that are typically complex and difficult to understand.
In this work, we describe methods for automatically generating queries and diversifying search results based on query documents, which can be used for query suggestion and for improving the quality of retrieval results. In particular, we focus on resolving three main issues related to domain-specific searches: (1) query generation, (2) query suggestion and formulation, and (3) search result diversification. Automatic query generation helps users by reducing the burden of formulating queries from query documents. Using generated queries as suggestions is investigated as a method of presenting alternative queries. Search result diversification is important in domain-specific search because of the nature of the query documents. Since query documents generally contain long complex descriptions, diverse query topics can be identified, and a range of relevant documents can be found that are related to these diverse topics.

The proposed methods we study in this thesis explicitly address these three issues. To solve the query generation issue, we use binary decision trees to generate effective Boolean queries and labeling propagation to formulate more effective phrasal-concept queries. In order to diversify search results, we propose two different approaches: query-side and result-level diversification. To generate diverse queries, we identify important topics from query documents and generate queries based on the identified topics. For result-level diversification, we extract query topics from query documents, and apply state-of-the-art diversification algorithms based on the extracted topics. In addition, we devise query suggestion techniques for each query generation method. To demonstrate the effectiveness of our approach, we conduct experiments in various search domains, and devise appropriate evaluation measures for domain-specific search environments.

Wentian Lu; Privacy-preserving Sanitization in Data Sharing; Gerome Miklau, Advisor; Sept. 2014; Software Engineer, Google Inc.

In the era of information explosion, data sharing is a common and everyday activity for our generation. When people request shared data for the purpose of monitoring, understanding and analyzing, it's not surprising that privacy concerns could be one of main barriers that disrupt such sharing behaviors, due to the fear of disclosing sensitive information.  Sharing data with privacy protection is an important and realistic problem in the real world.

In this dissertation, we consider the process of data sanitization that disguise the sensitive information before sharing a dataset. Our designing goal for sanitizing methods is always protecting privacy while preserving utility as much as possible, despite of varying data sources and privacy requirements. In particular, we first propose a framework for sharing a database under retention policies for auditing purpose. When obeying a retention policy often results in the wholesale destruction of the audit log in existing solutions, our framework allows to expire data at finer granularity and supports audit queries with incompleteness in a database. Secondly, we solve the problem of untrusted system evaluation using shared database synthesis under differential privacy. Our synthetic database accurately preserves the core performance measures of a given query workload, and satisfies differential privacy with crucial extensions to multi-relation databases. Lastly, we consider modeling graph under differential privacy and describe algorithms for sharing exponential random graph estimations. Our solution employs a decomposition of the estimation problem into two steps: getting private sufficient statistics first and then estimating the model parameters.  We show that our privacy mechanism provides provably less error than comparable methods and our redesigned estimation algorithm offers better accuracy.

Marc Maier; Causal Discovery for Relational Domains: Representation, Reasoning, and Learning; David Jensen, Advisor; Sept. 2014; Data Scientist, MassMutual

Traditional causal discovery--the development of algorithms that learn causal structure from observational data--is limited to data describing a single type of entity.  However, most real-world domains are characterized by systems that involve complex interactions among multiple entity types.  We formalize a relational representation and model that expresses causal dependencies among the attributes of interacting, heterogeneous entities.  We introduce a new theory--relational d-separation--and a lifted representation--the abstract ground graph--that supports a sound, complete, and computationally efficient method for algorithmically deriving conditional independencies from probabilistic models of relational data.  The abstract ground graph representation also presents causal implications that enable the detection of causal direction for a single dependency without parametric assumptions.  We leverage these implications and the theoretical framework of relational d-separation to develop a sound and complete algorithm that learns causal structure from relational data.

Trek Palmer; Using Formal Methods to Verify Transactional Abstract Concurrency Control; J. Eliot B. Moss, Advisor; Sept. 2014; Software Engineer at NuoDB Inc. 

Transactional Memory is a technique for simplifying the task of implementing concurrent applications.  However, it is difficult to integrate transactional and non-transactional code and the run-time's decisions can impact performance.  To remedy these shortcomings, there are several extensions to ordinary TM.  Each extension requires extra specification information for the run-time to determine how operations commute and how to `undo' those operations.  It is crucially important that these specifications be correct, but reasoning about the commutativity of operations is often challenging.  A solution to the problem of specifying and verifying the concurrency properties of abstract data structures is the subject of this thesis. We present a language, ACCLAM, for describing the abstract state of a data structure and reasoning about its concurrency control properties. We also describe a tool to process ACCLAM into a machine verifiable form. We provide a semantic description of ACCLAM and some example models with results.

Abhinav Parate; Designing Efficient and Accurate Behavior-Aware Mobile Systems; Deepak Ganesan, Advisor; Sept. 2014; Researcher, HP Research Labs

The proliferation of sensors on smartphones, tablets and wearables has led to a plethora of behavior classification algorithms designed to sense various aspects of individual user's behavior such as daily habits, activity, physiology, mobility, sleep, emotional and social contexts. This ability to sense and understand behaviors of mobile users will drive the next generation of mobile applications providing services based on the users' behavioral patterns. In this thesis, we investigate ways in which we can enhance and utilize the behavioral understanding of users in such mobile applications. In particular, we focus on identifying the key challenges in the following three aspects of behavior-aware applications: detection, analysis and prediction of user behaviors; and present systems and techniques developed to address these challenges.

In this thesis, we first demonstrate the utility of wrist-worn devices equipped with inertial sensors in real-time detection of health-related behaviors such as smoking and eating with high accuracy, recall and precision. Our approach detects these behaviors in a passive manner without any explicit user interaction and does not require use of any cumbersome device. Next, we design a context-query engine to support behavior analytics applications on mobile devices. Our proposed context-query engine performs information fusion across several user contexts for an individual user to enable optimizations like i) energy-efficient continuous context-sensing, and ii) accurate context inference. Finally, we present a behavior prediction algorithm that predicts the phone user's app usage behavior; and utilizes these predictions in improving the freshness of mobile applications such as Facebook that present users with the latest content fetched from the remote servers. We show that our proposed algorithm delivers application content to the user that is on an average fresh within 3 minutes. An implementation of our algorithm is available as a widget in Google Play Store that shows shortcuts for the predicted apps; and has been downloaded and installed on more than 50,000 devices.

Jae Hyun Park; Retrieval Models based on Linguistic Features of Verbose Queries; W. Bruce Croft, Advisor; Sept. 2014; Software Engineer, Nuance Communications, Inc.

Natural language expressions are more familiar to users than choosing keywords for queries.  Given that, people can use natural language expressions to represent their sophisticated information needs. Instead of listing keywords, verbose queries are expressed in a grammatically well-formed phrase or sentence in which terms are used together to represent the more specific meanings of a concept, and the relationships of these concepts are expressed by function words.

The goal of this thesis is to investigate methods of using the semantic and syntactic features of natural language queries to maximize the effectiveness of search. For this purpose, we propose the synchronous framework in which we use syntactic parsing techniques for modeling term dependencies. We use the Generative Relevance Hypothesis (GRH) to evaluate valid variations in dependence relationships between queries and documents. This is one of the first results demonstrating that dependency parsing can be used to improve retrieval effectiveness.

We propose a method for classifying concepts in verbose queries as key concepts and secondary concepts that are used in the statistical translation model for query term expansion.  Key concepts are the most important terms of queries. We use key concepts as the context for translating terms. Although secondary (key) concepts are not as important as key concepts, they are still important because they provide clues about what kinds of information users are looking for. Based on concept classification results, we elaborate a translation model in which terms are selectively translated according to the most important context of a given query or question.

We define the important new task of focused retrieval of answer passages that aims to immediately provide answers for users' information needs while the length of answer passage should be suitable for restricted search environments such as mobile devices and voice-based search systems.

Anand Seetharam; Efficient Routing and Scheduling in Wireless Networks; James F. Kurose, Advisor; Sept. 2014; Assistant Professor, California State University, Monterey Bay

The temporal and spatial variation in wireless channel conditions, node mobility make it challenging to design protocols for wireless networks. We design efficient routing and scheduling algorithms which adapt to changing network conditions to improve user-level performance. We analyze the performance of opportunistic and cooperative forwarding in static mesh networks showing that opportunism outperforms cooperation. For mobile networks, we quantitatively analyze the tradeoff between state information collection and power consumption for a fixed source-to-destination goodput constraint. For heterogeneous networks, we propose an efficient greedy algorithm which dynamically classifies individual nodes as routers/flooders and achieves superior performance. Last, we consider an application-level wireless streaming scenario and design a greedy algorithm for efficiently scheduling multiple video streams from a base station to mobile clients so as to minimize the total number of application-playout stalls. We also develop models for coarse timescale wireless channel variation to aid network and application-layer protocol design.

Robert Walls; Inference-Based Forensics For Extracting Information From Diverse Sources; Brian Levine, Advisor; Sept. 2014;  Postdoctoral Researcher, Department of Computer Science, Pennsylvania State University

Digital forensics is tasked with the examination and extraction of evidence from a diverse set of devices and information sources. While digital forensics has long been synonymous with file recovery, this label no longer adequately describes the science's role in modern investigations. Spurred by evolving technologies and online crime, law enforcement is shifting the focus of digital forensics from its traditional role in the final stages of an investigation to assisting investigators in the earliest phases -- often before a suspect has been identified and a warrant served. Investigators need new forensic techniques to investigate online crimes, such as child pornography trafficking on peer-to-peer networks (p2p), and to extract evidence from new information sources, such as mobile phones.

The traditional approach of developing tools tailored specifically to each source is no longer tenable given the diversity, volume of storage, and introduction rate of new devicesand network applications. Instead, we propose the adoption of flexible, inference-based techniques to extract evidence from any format. Such techniques can be readily applied to a wide variety of different evidence sources without requiring significant manual work on the investigator's part. The primary contribution of my dissertation is a set of novel forensic techniques for extracting information from diverse data sources. We frame the evaluation using two different, but increasingly important, forensic scenarios: mobile phone triage and network-based investigations.
Via probabilistic descriptions of typical data structures, and using a classic dynamic programming algorithm, our phone triage techniques are able to identify user information in phones across varied models and manufacturers. We also show how to incorporate feedback from the investigator to improve the usability of extracted information.

For network-based investigations, we quantify and characterize the extent of contraband trafficking on peer-to-peer networks. We suggest various techniques for prioritizing law enforcement's limited resources. We finally investigate techniques that use system logs to generate and then analyze a finite state model of a protocol's implementation. The objective is to infer behavior that an investigator can leverage to further law enforcement objectives.

We evaluate all of our techniques using the real-world legal constraints and restrictions of investigators.

Xiaoxi Xu; Computational Communication Intelligence: Exploring Linguistic Manifestation and Social Dynamics in Online Communication; Beverly Woolf, Advisor; Sept.2014; Data Scientist, Rovi Corporation

This dissertation aims to initiate a movement to improve people's communication capacities in online interactions. Drawing on the research from deliberation, communication, multiple intelligences, neural science, and education, we define an ability-based model for the intellectual construct of communication that we call communication intelligence. The intellectual model of communication intelligence is composed of ten high-order communication skills, including self-reflection, perspective taking, and balance. We create hierarchical probabilistic models and multi-task learning formulations for modeling and measuring people's intelligence-embodied communication skills. We also present a multi-perspective analysis for understanding communication intelligence, including its diverse language, shared linguistic characteristics, social dynamics, and the effects of communication modality on communication intelligence.