We looked at the problem of processing large-scale archival web graphs and generating a simple representation for the raw input graph data. This representation should allow the end user to be able to analyze & query the graph efficiently. Also, the representation should be flexible enough - so it can be loaded into a database or can be processed using distributed computing frameworks, ie. Hadoop.
To achieve these goals, we developed a workflow for archival graph processing within Hadoop. This project is still going on and its current status has appeared at LSDS-IR '11 workshop at CIKM conference last month. I like to share the paper [acm] [pdf] for those who are interested in further details. The abstract is following:
Also, the source code for "graph construction algorithm" is open sourced at GitHub.In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC’s Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.