OpenWayback - Wayback Hadoop CDX generation

Overview

Wayback is distributed with a .jar file that simplifies creation of large-scale CDX files using hadoop. This code is experimental, and will primarily be useful only if your CDX files are very large - more than a few hundred GB (or more, depending on your hardware). If building or updating your CDX files is the largest problem with your installation, this may help. At IA, we've used this framework to build and deploy CDX files of more than 700GB, containing billions of records, using a 24 node cluster in about 8 hours from start to finish. Just writing a 700GB file to disk at 50MB/sec takes around 4 hours, so the final deployment step takes around half the time.

Requirements

Existing hadoop cluster running Hadoop 0.20.2.
Per-resource CDX files existing in a viable Hadoop-FS (HDFS, S3, etc).
Perl, to create a split file based on a sample CDX.

Implementing

Using hadoop to generate your CDX files requires the following high-level process:

Integrating per-WARC CDX creation into your ingestion process.
Building a split file, to inform hadoop on how to efficiently partition your data while sorting.
Building a manifest listing the specific per-WARC CDX files to sort.
Running the hadoop job, which produces a series of alphabetically contiguous, partitioned CDX in your HDFS.
Deploying the partitioned CDX files to your node running Wayback.

Process integration

It is assumed you will integrate the Wayback indexing code, cdx-indexer into your standard file ingestion workflow. That is, whatever system is used to move data from your crawlers into your permanent repository should be modified to also build a CDX file for each W/ARC file, as it is ingested, and to store that CDX file in your HDFS. As an optimization, you can compress the per-WARC CDX files before storing them in HDFS. If your per-W/ARC CDX files are named with a trailing, .gz suffix, the Wayback hadoop code will infer that these input files are compressed.

Building the split file

CDX files are large sorted text files. Hadoop can be used to perform large distributed sort operations, but to achieve an efficient total ordering across your resulting data, you need to give hadoop some explicit instructions, in the form of the split file, indicating how to distribute the data in your hadoop job.

The split file is a text file, with each line indicating a partition point URL within the total possible URL space. The number of lines determines the number of chunks that will be built within hadoop, and it should be based on the number of concurrent Reduce tasks you can run concurrently on your cluster.

If R is the number of reduce tasks you can run at the same time on your hadoop cluster, you should use (R-5) as the second argument to cdx-sample, which is distributed in the wayback .tar.gz distribution. 5 leaves a few spare reduce workers in case of node failure, and for speculative execution in case some of your nodes are running slowly.

The more accurately the partition points evenly divide your particular collections URLs, the more optimally your hadoop distributed processing will execute. It is assumed that if you are using this hadoop to generate your CDX, you will already have built a sizable CDX file for your collection. The cdx-sample tool will sample an existing sorted CDX file for your collection, and produce a list of URL partitions that can be used as the split file for your hadoop processing. You should use the most recent sizable CDX built using other methods with the cdx-sample tool. If you don't have a previously built sorted CDX file for your collection, create a sample sorted CDX file from 20 or 30 random per-WARC CDX files, as described elsewhere, and use that with the cdx-sample tool.

You might use something similar to the following command to build your split file, assuming an previously built, sorted CDX file for your collection called existing.cdx, and a total reducer capacity of 20:

cdx-sample existing.cdx 15 > split.txt
hadoop fs -put split.txt /user/brad/input-split.txt

Building the manifest

The second input file you will need is your list of per-WARC (or per-ARC) CDX files to process.

This file can be built using the hadoop fs -ls command, and should contain one line for each CDX file you want to sort into your final CDX file.

This is an example line suitable for a manifest file:

hdfs:///cdx/COLL-A/COLLECTION-A-20080726045700-00019-ia400028.us.archive.org.warc.os.cdx.gz

You might use something similar to the following command to build your manifest:

hadoop fs -ls /cdx/collectionA | perl -ane 'print "hdfs://$F[-1]\n";' | grep cdx.gz > manifest.txt
hadoop fs -put manifest.txt /user/brad/input-manifest.txt

Running the job

This is actually the simplest part! You just need to run:

hadoop jar PATH_TO_WAYBACK_HADOOP_JAR cdxsort -m MAPS [--compress-output] SPLIT INPUT OUTPUT_DIR

The --compress-output option will cause the resulting CDX files in HDFS to be compressed.

Here is an example usage:

 
hadoop jar /home/brad/wayback-hadoop-jar-with-dependencies.jar cdxsort -m 470 --compress-output /user/brad/input-split.txt /user/brad/input-manifest.txt /user/brad/cdx-output

indicating 470 map tasks, and that the resulting files should be compressed. The number of map tasks to use should be roughly 1/3rd the number of lines in your INPUT file.

Deploying the production Wayback CDX:

The previous hadoop command will create alphabetically contiguous, sorted CDX files in your HDFS output directory(OUTPUT_DIR). To merge them into a single CDX file which can be efficiently searched using Wayback, you need to dump them into a single, concatenated file. For now, you have to use some shell code:

for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do
   hadoop fs -cat $i
done > LOCAL_FILE

where OUTPUT_DIR is the same as the one specified in your hadoop job, and where LOCAL_FILE is where you want your target file to exist, on the local computer.

If you did specified the --compress-output option with your "hadoop jar ..." command, you will need to add 'zcat' as follows:

for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do
   hadoop fs -cat $i | zcat
done > LOCAL_FILE

At this point, LOCAL_FILE is ready for use as a Wayback CDX.