OpenWayback Advanced configuration
For information on ZipNum format http://aaron.blog.archive.org/2013/05/28/zipnum-and-cdx-cluster-merging/
Enable and edit CDXCollection.xml as follows:
<property name="resourceIndex"> <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> <property name="canonicalizer" ref="waybackCanonicalizer" /> <property name="source"> <bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource"> <property name="cluster"> <bean class="org.archive.format.gzip.zipnum.ZipNumCluster"> <property name="summaryFile" value="/<PATH-TO-SUMMARYFILE>"/> <property name="locFile" value="/<PATH-TO-LOCFILE>" /> </bean> </property> <property name="params"> <bean class="org.archive.format.gzip.zipnum.ZipNumParams"/> </property> </bean> </property> <property name="maxRecords" value="100000" /> <property name="dedupeRecords" value="true" /> </bean> </property>
Summary file format
Summary file consists of 4 columns separated by tab as follows:
1. The first line of each chunk
2. Chunk name (or shard name)
3. Offset: the starting byte-offset of the chunk
4. Length: the length of the chunk
Loc file format
Loc file consists of 2 columns separated by tab as follows:
1. Chunk name (or shard name)
2. Chunk URL: e.g. hdfs://url or http://url
For more information on how to generate summary file using hadoop, please see link at the top.