Dilemmas of algorithm design in the big data era

October 15, 2015

In an interview with Barna Saha, CICS assistant professor, she described some of the research challenges and latest advances in the area of algorithm design and analysis.

The amount of data in our world has been exploding at an unforeseeable rate. The increasing volume and detail of information captured by enterprises, and the rise of multimedia, social media, and the Internet of Things are contributing to this exponential growth. Healthcare industries are promising to transform the world through "big" data. The ongoing data deluge is bringing in new opportunities in businesses, finances, and education. But with big promises, come big challenges. "If we have to bear the fruit of big data, we need to analyze this large amount of data in a timely manner and extract relevant information from it," said Saha. "We need algorithms that run fast, are robust to changes in data, and understand the trade-off between efficiency and accuracy."

From its inception, complexity theory through concepts such as NP-Hardness has classified computational problems into those that have relatively efficient solutions versus those that are intractable. Any problem that can be solved in time polynomial in the input size falls in the first category. "Let us consider the Facebook graph where each user represents a vertex in the graph and there is an edge between two vertices if the corresponding users are friends," said Saha. In 2014, Facebook had more than 1.32 billion users, approximately 1/7th of the world population. A simple query of finding all-pairs shortest distance requires cubic time-complexity in the number of vertices, i.e., more than the total time since the inception of the earth. Clearly this is not efficient. "We need a finer-grained classification of problems categorized as 'efficient' under the traditional complexity theory," noted Saha. "We need to design faster algorithms, or to understand why such algorithms do not exist."

"Can we improve the cubic running time for finding all-pairs shortest paths in general graphs?" questioned Saha. There is no concrete answer to this question. But a recent result shows that if the edges have arbitrary real weights, then finding a truly subcubic algorithm, one that runs in n(3-d) (n is the number of vertices, d >0) time will be a breakthrough. This will lead to better running times for a large number of computational problems from disparate domains for which no progress has been made for decades. On the other hand, faster algorithms are known when the graph is sparse, or if we allow a small "approximation." Exploiting the structural properties of the underlying instance, and allowing approximations, are two key ingredients for designing fast algorithms. Approximation introduces inaccuracies. If the true distance between some pair of nodes is "x," an approximation algorithm may return results within x(1 +- e) where the goal is to make e as small as possible. The smaller the e, the better the quality of solution and the slower the algorithm. "When time is of essence, a fast approximation algorithm can be of greater value than a slow exact algorithm," added Saha.

While all-pairs shortest paths compute distances among vertices in a graph, edit distance is a fundamental distance measure between sequences. It computes the number of edits (e.g. insertion, deletion, substitution) to convert one string to another, and has wide applications in computational biology, natural language processing, and so forth. A recent article in MIT Review (http://news.mit.edu/2015/algorithm-genome-best-possible-0610) reports on a new result on the complexity of finding edit distance between two sequences. For decades researchers have tried to design subquadratic algorithms for edit distance computation; this result shows such algorithms will refute "the strongly exponential time hypothesis" (SETH), a conjecture that is becoming a main device for proving conditional hardness of problems. A broader problem with an overwhelming number of applications arises when we allow matching a sequence to a class of sequences, instead of just a single sequence. Consider a class of documents, and we want to know whether a specific document matches the semantics of this class, and if not what minimal changes are required to match it with any one of them. Often the valid class of documents can be represented by a formal grammar. In that case, the problem of computing edit distance to any valid member of the grammar is known as the "language edit distance problem." Using dynamic programming, it is possible to compute exact language edit distance in cubic time. In a recent result, Saha shows this is nearly the best possible - connecting this problem to many fundamental graph problems such as all-pairs shortest paths and beyond. This provides a bridge between the complexity of distance computation over sequences to basic graph problems.

"Can we design faster approximation algorithms for the language edit distance problem beating the cubic hardness? In one of our works, we show indeed that is possible," said Saha. "Our algorithm also gives fast alternatives to many graph problems exploiting the connection. The main crucial ingredient in this algorithm is a method of 'dependent randomized rounding.' First, we show the language edit distance computation can be computed fast if for all substrings the true distance belongs to a small possible set of distances. If that is not the case, we 'round' the actual distances to belong to that set. We select the rounded value by following a carefully crafted probability distribution depending on the value of the actual solution, and we slowly build the solution to successively larger sequences like in dynamic programming. Indeed, our method can be useful to accelerate dynamic programming, one of the most fundamental methods for developing polynomial time algorithms. Exact dynamic programmings are often slow." In a recent project funded by NSF, Saha is striving to build new generic tools for scalable dynamic programmings, where approximation and randomization promise to play very important roles.

"In prior works, we have developed new paradigms of dependent randomized rounding which were useful to obtain polynomial time algorithms for many NP-Hard problems," said Saha. "Here, a similar concept helps us to design fast algorithm for a polynomial time problem." The area of understanding the fine-grained complexity of problems, and designing fast algorithms through that lens is still in its early-stage, and there are many open problems. In the big data era, when a quadratic running time is as bad as a NPHard problem in efficiency, the importance of fine-grained complexity is growing. As a fellow in the Simons Institute at the University of California Berkeley, Saha is participating in a semester-long program devoted to understanding the main challenges in this area, and making progress on a variety of fundamental questions in the theory of computation utilizing close connections between complexity theory and algorithm design.