OpenWayback Advanced configuration
For information on ZipNum format http://aaron.blog.archive.org/2013/05/28/zipnum-and-cdx-cluster-merging/
Enable and edit CDXCollection.xml as follows:
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
<property name="canonicalizer" ref="waybackCanonicalizer" />
<property name="source">
<bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
<property name="cluster">
<bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
<property name="summaryFile" value="/<PATH-TO-SUMMARYFILE>"/>
<property name="locFile" value="/<PATH-TO-LOCFILE>" />
</bean>
</property>
<property name="params">
<bean class="org.archive.format.gzip.zipnum.ZipNumParams"/>
</property>
</bean>
</property>
<property name="maxRecords" value="100000" />
<property name="dedupeRecords" value="true" />
</bean>
</property>
Summary file format
Summary file consists of 4 columns separated by tab as follows:
1. The first line of each chunk
2. Chunk name (or shard name)
3. Offset: the starting byte-offset of the chunk
4. Length: the length of the chunk
Loc file format
Loc file consists of 2 columns separated by tab as follows:
1. Chunk name (or shard name)
2. Chunk URL: e.g. hdfs://url or http://url
For more information on how to generate summary file using hadoop, please see link at the top.