A ResourceIndex locates documents within a WaybackCollection through a single method:
public SearchResults query(final WaybackRequest request) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, AccessControlException;
When the request indicates the user wishes to find specific captures of a single URL, CaptureSearchResults should be returned. When the request may return results for multiple URLs, for example a query attempting to locate all URLs beginning with a given prefix within the WaybackCollection, a URLSearchResults object should be returned.
This ResourceIndex implementation assumes a local database of all documents within the WaybackCollection. The type of database is specified with the source property.
The following configuration is required for a LocalResourceIndex:
The following configurations are optional for LocalResourceIndexes:
For specific Spring configuration examples of these ResourceIndex options, please refer to the following files distributed within the wayback .war file:
This ResourceIndex implementation requests an external Wayback installation to satisfy index requests, and can be useful for distributed installations, as well as for experimenting with new Wayback configurations and installations using an existing ResourceIndex. For example, a development system can be configured to use a production index remotely, minimizing the requirements and setup required to test new configurations.
The actual index must be stored on another Wayback installation, and is requested as XML through this implementation.
The following configuration is required for a RemoteResourceIndex:
The following configurations are optional for LocalResourceIndexes:
For a Spring configuration example of this ResourceIndex option, please refer to the following files distributed within the wayback .war file:
Sometimes URLs found in the field can have multiple forms, for example:
http://www.example.com/img/foo.gif http://www.example.com/docs/../img/foo.gif
http://www.example.com/Interview.html http://www.example.com/interview.html
Another example, which occurs far more often in the real world, involves web servers injecting a session ID inside paths to documents hosted on that web server. These session IDs allow the web server to track individual user's states. Here are some example URLs demonstrating path session ID injection:
http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page1.aspx http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page2.aspx http://www.example.com/(S(a63098d96360a63098d96360))/page3.aspx
<a href="page1.aspx">page1</a>
http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx
This session ID problem can be mitigated by canonicalizing the URLs as they are placed in the index, so the index would contain the following URLs, instead of the original form, which the crawler captured:
http://www.example.com/page1.aspx http://www.example.com/page2.aspx http://www.example.com/page3.aspx
Currently the Wayback includes only a single reference implementation of a canonicalization scheme, which is currently called AggressiveUrlCanonicalizer. This implementation provides the following canonicalization:
At the IA, we have recently switched to building CDX files using the -identity option on the arc-indexer and warc-indexer tools. The -identity option requires passing records through the url-client tool before sorting and merging into production CDX files. By keeping the original "identity" CDX files, we have been able to test various URL canonicalization strategies without the overhead of re-processing all the ARC/WARC source materials.
In upcoming wayback releases, we intend to provide more canonicalization implementations, including a configurable implementation that will allow broad customization capabilities.
We also intend to alter the format of wayback indexes significantly. Using this new format will be optional, but once indexes are created in the new format is created, other indexes with different canonicalization strategies can be built from them without requiring a complete reindex of the original ARC/WARC content.
The new format will also allow a degree of dynamic canonicalization at run-time, meaning different strategies can be tested using the same indexes, and site-specific canonicalization strategies may be possible.
We anticipate that allowing (advanced) users to easily change between canonicalization strategies within the same wayback session will promote better community understanding of the impacts of different strategies, and will enable the community to build a set of best practices for URL canonicalization.
Heritrix 1.12 and above have the capability to write WARC files, which omit storing documents that have not changed since a previous visit. For specifics on activating these features, please refer to the Heritrix documentation. When Heritrix is using these features, and notices that a document has not changed since the last time it was visited, it creates an abbreviated WARC record, indicating that the document was retrieved but not stored. In this abbreviated WARC record is an indicator of the SHA1 digest of the document.
The wayback uses these identical SHA1 digests to map the location (ARC/WARC + offset) of the original record that was stored to subsequent records that were not. When a request for a subsequent capture that was not stored is received by wayback, it will return the content of the previous stored record.
The matching of these digests occurs at query time, and is configured by setting the "dedupeRecords" option of the LocalResourceIndex to "true".