Wayback is distributed with a .jar file that simplifies creation of large-scale CDX files using hadoop. This code is experimental, and will primarily be useful only if your CDX files are very large - more than a few hundred GB (or more, depending on your hardware). If building or updating your CDX files is the largest problem with your installation, this may help. At IA, we've used this framework to build and deploy CDX files of more than 700GB, containing billions of records, using a 24 node cluster in about 8 hours from start to finish. Just writing a 700GB file to disk at 50MB/sec takes around 4 hours, so the final deployment step takes around half the time.
Using hadoop to generate your CDX files requires the following high-level process:
It is assumed you will integrate the Wayback indexing code, cdx-indexer into your standard file ingestion workflow. That is, whatever system is used to move data from your crawlers into your permanent repository should be modified to also build a CDX file for each W/ARC file, as it is ingested, and to store that CDX file in your HDFS. As an optimization, you can compress the per-WARC CDX files before storing them in HDFS. If your per-W/ARC CDX files are named with a trailing, .gz suffix, the Wayback hadoop code will infer that these input files are compressed.
CDX files are large sorted text files. Hadoop can be used to perform large distributed sort operations, but to achieve an efficient total ordering across your resulting data, you need to give hadoop some explicit instructions, in the form of the split file, indicating how to distribute the data in your hadoop job.
The split file is a text file, with each line indicating a partition point URL within the total possible URL space. The number of lines determines the number of chunks that will be built within hadoop, and it should be based on the number of concurrent Reduce tasks you can run concurrently on your cluster.
If R is the number of reduce tasks you can run at the same time on your hadoop cluster, you should use (R-5) as the second argument to cdx-sample, which is distributed in the wayback .tar.gz distribution. 5 leaves a few spare reduce workers in case of node failure, and for speculative execution in case some of your nodes are running slowly.
The more accurately the partition points evenly divide your particular collections URLs, the more optimally your hadoop distributed processing will execute. It is assumed that if you are using this hadoop to generate your CDX, you will already have built a sizable CDX file for your collection. The cdx-sample tool will sample an existing sorted CDX file for your collection, and produce a list of URL partitions that can be used as the split file for your hadoop processing. You should use the most recent sizable CDX built using other methods with the cdx-sample tool. If you don't have a previously built sorted CDX file for your collection, create a sample sorted CDX file from 20 or 30 random per-WARC CDX files, as described elsewhere, and use that with the cdx-sample tool.
You might use something similar to the following command to build your split file, assuming an previously built, sorted CDX file for your collection called existing.cdx, and a total reducer capacity of 20:
cdx-sample existing.cdx 15 > split.txt hadoop fs -put split.txt /user/brad/input-split.txt
The second input file you will need is your list of per-WARC (or per-ARC) CDX files to process.
This file can be built using the hadoop fs -ls command, and should contain one line for each CDX file you want to sort into your final CDX file.
This is an example line suitable for a manifest file:
hdfs:///cdx/COLL-A/COLLECTION-A-20080726045700-00019-ia400028.us.archive.org.warc.os.cdx.gz
You might use something similar to the following command to build your manifest:
hadoop fs -ls /cdx/collectionA | perl -ane 'print "hdfs://$F[-1]\n";' | grep cdx.gz > manifest.txt hadoop fs -put manifest.txt /user/brad/input-manifest.txt
This is actually the simplest part! You just need to run:
hadoop jar PATH_TO_WAYBACK_HADOOP_JAR cdxsort -m MAPS [--compress-output] SPLIT INPUT OUTPUT_DIR
Here is an example usage:
hadoop jar /home/brad/wayback-hadoop-jar-with-dependencies.jar cdxsort -m 470 --compress-output /user/brad/input-split.txt /user/brad/input-manifest.txt /user/brad/cdx-output
The previous hadoop command will create alphabetically contiguous, sorted CDX files in your HDFS output directory(OUTPUT_DIR). To merge them into a single CDX file which can be efficiently searched using Wayback, you need to dump them into a single, concatenated file. For now, you have to use some shell code:
for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do hadoop fs -cat $i done > LOCAL_FILE
If you did specified the --compress-output option with your "hadoop jar ..." command, you will need to add 'zcat' as follows:
for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do hadoop fs -cat $i | zcat done > LOCAL_FILE
At this point, LOCAL_FILE is ready for use as a Wayback CDX.