Faculty Recruiting

Taming massive graphs

Pick up your favorite algorithms textbook, flip to random page, and there's a good chance that you'll find an algorithm for solving a problem about graphs. This shouldn't be surprising given that many interesting types of data are naturally represented as graphs. Examples include the transportation networks represented in Google Maps; social networks such as Facebook and Twitter; and protein-protein interaction networks in biology. However, implicit in the design of many of the existing algorithms is the assumption that the graphs of interest are static and are small enough to fit in the main memory of a single machine. Unfortunately, in many applications these assumptions are no longer reasonable.

"Many of the graphs that we need to process these days are massive and are constantly changing", says Professor Andrew McGregor. "For example, the web graph has over ten billion nodes and hyperlinks are constantly being added or removed. This necessitates new approaches to the design and analysis of algorithms for such graphs." He goes on to explain that algorithms whose running time is even quadratic in the size of the graph can often no longer be considered practical. Furthermore, algorithms may need to handle data that is distributed between numerous machines or is defined by data streams.

McGregor and his group have been investigating various aspects of the problem. One approach to dealing with a massive graph is to first compress the graph in a way that approximately preserves the relevant properties of the graph. The idea is analogous to using MP3 and JPEG files on your phone rather than the original songs and photos. While some of the information is lost, the hope is that the difference is imperceptible. For graphs, we need to be careful with what it means for the difference to be imperceptible and different notions (e.g., the distances between pairs of nodes or the size of cuts are approximately preserved) lead to different forms of compression. The benefits of compressing the graphs include the fact that running any existing algorithm on a smaller graph will be faster than running the algorithm on a large graph. The compressed graph may also fit in main memory thus avoiding I/O bottlenecks, and even if the graph data is stored across numerous machines, compressing the input reduces communication overheads.

However, this naturally raises the question of how to efficiently perform the compression. A popular technique for compressing some other types of data is to use random projections. For a simple example, consider representing a collection of text documents as vectors where the entries encode the frequency of the different words in the corresponding document. Then the similarity of two documents can be measured in terms of the distance between the corresponding vectors. The dimension of the vectors would be the size of the vocabulary which would be roughly hundreds of thousands in the case of English. However, by first applying random projections to the vectors, an existing result shows that it is possible to dramatically reduce their dimension without significantly perturbing the distance between any pair of vectors.

It initially seemed very unlikely that the random projections technique, also known as "sketching", would be of any use when it came to graph problems. In particular, the answer to a graph problem can be very sensitive to slight changes in the input and a single edge could make a big difference for even a simple problem like trying to test whether a graph was connected.  "We actually spent a considerable amount of time failing to prove impossibility results," admits McGregor "and I remember very clearly the moment I realized there was a good reason; because it was, in fact, possible!"

Their initial result still strikes McGregor as surprising. He illustrates it in the following way. Imagine we ask all the students at the university to determine whether the graph of their friendships is connected. A simple solution would be if every student listed all their friendships and we then studied these lists. Unfortunately, this could involve a lot of writing for the students that had many friends and someone who was friends with everyone would be there all day! The new result shows that even if each student knows nothing about the graph other than their immediate friends, she can summarize the list of her friends using a number of bits that is roughly logarithmic in the number of students. What makes this particularly surprising is that there could be a friendship between two students, say Alice and Bob, that is essential to the connectivity of the entire graph but neither Alice nor Bob would know this friendship was important when they summarize their lists. The solution devised by McGregor and his colleagues combines ideas from spectral graph theory and a new sampling technique developed in the context of data stream processing.

McGregor's research group have been working on extending their results to other classic graph problems. There also seem to be wider applications for the techniques they have developed including designing faster data structures for dynamic graphs; verification of outsourced graph processing; and new "sliding window" algorithms where the goal is to ensure that no outdated information corrupts the graph being processed. McGregor is organizing workshops in Japan and Canada next year to further explore connections between graph compression, graph streaming, and distributed graph processing.