OpenWayback - Resource Index Configuration

ResourceIndex configuration options

Overview

A ResourceIndex locates documents within a WaybackCollection through a single method:

  public SearchResults query(final WaybackRequest request)
    throws ResourceIndexNotAvailableException,
    ResourceNotInArchiveException, BadQueryException,
    AccessControlException;

The ResourceIndex is responsible for deciding which SearchResults subclass, CaptureSearchResults or UrlSearchResults, is appropriate for the WaybackRequest argument, and for populating the returned SearchResults object with matching records.

When the request indicates the user wishes to find specific captures of a single URL, CaptureSearchResults should be returned. When the request may return results for multiple URLs, for example a query attempting to locate all URLs beginning with a given prefix within the WaybackCollection, a URLSearchResults object should be returned.

LocalResourceIndex configuration options

This ResourceIndex implementation assumes a local database of all documents within the WaybackCollection. The type of database is specified with the source property.

The following configuration is required for a LocalResourceIndex:

source - a bean implementing SearchResultSource, which can be one of the following:
- BDBIndex - a BDBJE database holding records for all documents within the WaybackCollection. This implementation allows for fast incremental updates to the index, and is required when using automatic indexing. This implementation scales well to 10's of millions of records.
- CDXIndex - a sorted flat file containing one line per document within the WaybackCollection. This implementation requires that the CDX file be manually maintained, but scales to very large sizes, limited primarily by the size of file you can build and store. CDX files can be built using the command line tool arc-indexer or warc-indexer, and the standard UNIX sort tool.
- CompositeSearchResultSource - an implementation allowing aggregation of multiple SearchResultSources into a single logical SearchResultSource. Use of BDBIndex SearchResultSources within this class is experimental, but this implementation has been used successfully in production installations to serve results from several CDXIndex files. For optimal search efficiency, multiple index files should be merged (sort -mu) prior to production use, but this implementation allows a trade-off in simplified index management for a decrease in search performance. A useful strategy for managing large scale collections is to use several CDX files of increasing size. Updates to the set of CDX files are always performed against the smallest CDX file, and occasionally this small file is merged with one of the larger files, minimizing the amount of data that needs to be read, sorted, and written back to disk to update the set of CDX files.

The following configurations are optional for LocalResourceIndexes:

maxRecords - integer maximum number of records to process for a single request. Useful to prevent a single request from using too much Disk and CPU resources.
dedupeRecords - boolean value that should be set to true when using deduplicated WARC records. This causes Wayback to modify search results as they are read from the index, so records indicating a resource was inspected but not saved are accessible within the Wayback. Please see the Duplicate Reduction section below for more information.
annotater - experimental hook for modifying or omitting records as they are read from the index. For example, additional metadata could be associated with each record from an external datasource, and this extra metadata could then be exposed to end users through a .jsp customization.
canonicalizer - an implementation of UrlCanonicalizer. See the section labeled URL Canonicalization below for more information.
filter - an implementation of ObjectFilter<CaptureSearchResult> which will remove records at query time from the index.

For specific Spring configuration examples of these ResourceIndex options, please refer to the following files distributed within the wayback .war file:

RemoteResourceIndex configuration options

This ResourceIndex implementation requests an external Wayback installation to satisfy index requests, and can be useful for distributed installations, as well as for experimenting with new Wayback configurations and installations using an existing ResourceIndex. For example, a development system can be configured to use a production index remotely, minimizing the requirements and setup required to test new configurations.

The actual index must be stored on another Wayback installation, and is requested as XML through this implementation.

The following configuration is required for a RemoteResourceIndex:

searchUrlBase - the URL prefix indicating the AccessPoint actually holding the ResourceIndex.

The following configurations are optional for LocalResourceIndexes:

canonicalizer - an implementation of UrlCanonicalizer. See the section labeled URL Canonicalization below for more information.

For a Spring configuration example of this ResourceIndex option, please refer to the following files distributed within the wayback .war file:

RemoteCollection.xml

URL Canonicalization

Introduction and Concepts

Sometimes URLs found in the field can have multiple forms, for example:

            http://www.example.com/img/foo.gif
            http://www.example.com/docs/../img/foo.gif

are both valid representations of the exact same URL. Another, less certain example would be:

            http://www.example.com/Interview.html
            http://www.example.com/interview.html

which differ only in the capitalization of the letter "i". On some operating systems, these two URLs legitimately specify two distinct documents. On Windows platforms, they refer to the same document. If the document on a web server is actually named "Interview.html", but a web designer creates a web page that refers to this document using the lowercase "interview.html", then the link will work, and they and the web site visitors may never notice the difference. The same situation on a different operating system would probably not work (although some web server plugins and modules will also correct this problem transparently) and the web designer would probably notice and correct the problem. In practice, we have found that it is very rare for the two URLs above with different capitalization to refer to different documents, and they can be treated as equivalent in most situations.

Another example, which occurs far more often in the real world, involves web servers injecting a session ID inside paths to documents hosted on that web server. These session IDs allow the web server to track individual user's states. Here are some example URLs demonstrating path session ID injection:

            http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page1.aspx
            http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page2.aspx
            http://www.example.com/(S(a63098d96360a63098d96360))/page3.aspx

In these examples, the first two URLs are using one session ID, and the third uses a different session ID. If page3.aspx refers to page1.aspx using an anchor like this:

            <a href="page1.aspx">page1</a>

and a user visiting page3.aspx clicks the link to page1, then the wayback will recieve a request for the URL:

            http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx

If page1.aspx was captured using the different session ID, then the wayback will be unable to locate this document in the index, even though it was captured.

This session ID problem can be mitigated by canonicalizing the URLs as they are placed in the index, so the index would contain the following URLs, instead of the original form, which the crawler captured:

            http://www.example.com/page1.aspx
            http://www.example.com/page2.aspx
            http://www.example.com/page3.aspx

If the same canonicalization scheme is used to transform incoming requests, before attempting to lookup URLs in the index, then the software is able to locate and return the documents correctly.

Current Status within Wayback

Currently the Wayback includes only a single reference implementation of a canonicalization scheme, which is currently called AggressiveUrlCanonicalizer. This implementation provides the following canonicalization:

www# removal http://www.example.com => example.com, http://www13.example.com => example.com
user info removal http://user@example.com => example.com, http://user:password@example.com => example.com,
default port removal http://example.com:80 => example.com,
session ID removal http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx => example.com/page1.aspx

(and other common session ID path injection schemes)
path and CGI argument lowercasing http://www.example.com/Interviews.cgi?Interview=Left => example.com/interviews.cgi?interview=left
extra query argument delimiter removal http://www.example.com/Interviews.cgi?Interview=Left& => example.com/interviews.cgi?interview=left
unneeded query specifier removal http://www.example.com/Interviews.cgi? => example.com/interviews.cgi

These heuristics generally lead to correcting many common URL lookup problems, but in some cases, these operation do the wrong thing, typically by making content which is actually different appear to be the same thing.

At the IA, we have recently switched to building CDX files using the -identity option on the arc-indexer and warc-indexer tools. The -identity option requires passing records through the url-client tool before sorting and merging into production CDX files. By keeping the original "identity" CDX files, we have been able to test various URL canonicalization strategies without the overhead of re-processing all the ARC/WARC source materials.

Future Directions within Wayback

In upcoming wayback releases, we intend to provide more canonicalization implementations, including a configurable implementation that will allow broad customization capabilities.

We also intend to alter the format of wayback indexes significantly. Using this new format will be optional, but once indexes are created in the new format is created, other indexes with different canonicalization strategies can be built from them without requiring a complete reindex of the original ARC/WARC content.

The new format will also allow a degree of dynamic canonicalization at run-time, meaning different strategies can be tested using the same indexes, and site-specific canonicalization strategies may be possible.

We anticipate that allowing (advanced) users to easily change between canonicalization strategies within the same wayback session will promote better community understanding of the impacts of different strategies, and will enable the community to build a set of best practices for URL canonicalization.

Duplicate Reduction

Heritrix 1.12 and above have the capability to write WARC files, which omit storing documents that have not changed since a previous visit. For specifics on activating these features, please refer to the Heritrix documentation. When Heritrix is using these features, and notices that a document has not changed since the last time it was visited, it creates an abbreviated WARC record, indicating that the document was retrieved but not stored. In this abbreviated WARC record is an indicator of the SHA1 digest of the document.

The wayback uses these identical SHA1 digests to map the location (ARC/WARC + offset) of the original record that was stored to subsequent records that were not. When a request for a subsequent capture that was not stored is received by wayback, it will return the content of the previous stored record.

The matching of these digests occurs at query time, and is configured by setting the "dedupeRecords" option of the LocalResourceIndex to "true".