Requirements

Third Party Packages

Please see the System Requirements .

Wayback Software

Please see the Software Downloads page .

Installing

Installing Tomcat

Please refer to the README file included with your Tomcat distribution.

Installing Wayback

Once you have downloaded the .tar.gz file from sourceforge, you will need to unpack the file to access the webapp file, wayback-webapp-1.6.0.war.

Installation and configuration of this software involves the following steps:

  1. Placing .war file in appropriate location.
  2. Waiting for Tomcat to unpack the .war file.
  3. Customizing base wayback.xml and possibly other XML configuration files.
  4. Restarting tomcat.

Wayback Configuration Overview

The wayback software provides Query and Replay access to archived documents. Query access allows users to locate particular documents within the collection by URL and date. Replay access allows users to view archived pages within their web browsers. Some Replay modes require altering the original pages and resources, so embedded and referenced content is also loaded from the Wayback service, and not from the live web.

A WaybackCollection defines a set of archived documents and an index which allows documents to be quickly located within the collection. A WaybackCollection may be exposed to end users through one or more AccessPoints, which define:

  • the WaybackCollection itself
  • the URL where users can access the collection
  • how query results are presented to users (the Query UI)
  • how documents are returned to users so they appear correctly in their web browsers (the Replay UI)
  • the look and feel of the wayback user interface
  • who can access the documents in the collection
  • which documents from the collection are available

Wayback is configured using Spring IOC, to specify and configure concrete implementations of several basic modules. Please see the Spring website for more information on configuring beans using Spring XML.

AccessPoint configuration options

An AccessPoint's configuration must specify the following implementations:

  • collection the specific WaybackCollection being exposed via this AccessPoint.
  • query responsible for generating user visible content(HTML, XML, etc) in response to user Queries.
  • replay responsible for determining the appropriate ReplayRenderer implementation based on the users request and the particular document to be Replayed.
  • uriConverter responsible for constructing Replay URLs from records matching users queries. See Replay Modes below.
  • parser - responsible for translating incoming requests into WaybackRequests. See Replay Modes below.

An AccessPoint's configuration may optionally specify the following, but must specify at least one of replayPrefix, queryPrefix, or staticPrefix:

  • exception - an implementation responsible for generating error pages to users
  • configs - a Properties associating arbitrary key-value pairs which are accessible to .jsp files responsible for generating the UI
  • exclusionFactory - an implementation specifying what documents should be accessible within this AccessPoint
  • authentication - an implementation specifying who is allowed to connect to this AccessPoint
  • replayPrefix - a String URL prefix indicating the host, port, and path to the correct Replay AccessPoint. If unspecified, defaults to queryPrefix, then staticPrefix.
  • queryPrefix - a String URL prefix indicating the host, port, and path to the correct Query AccessPoint. If unspecified, defaults to staticPrefix, then replayPrefix.
  • staticPrefix - a String URL prefix indicating the host, port, and path to static content used within the UI. If unspecified, defaults to queryPrefix, then replayPrefix.
  • livewebPrefix - a String URL prefix indicating the host, port, and path to an AccessPoint configured with Live Web fetching.
  • locale - A specific Locale to use for all requests within this AccessPoint, overriding the users preferred Locale as specified by their web browser.
  • exactHostMatch - true or false, if true, only returns results exactly matching a given request hostname (case insensitive). Default is false.
  • exactSchemeMatch - true of false, if true, only returns results exactly matching a given request scheme. Default is true.

AccessPoints can be used to provide different levels and types of access to the same collection for different users. For example, you can provide both Proxy and Archival URL mode access to a single collection by defining 2 AccessPoints with different Replay User Interfaces but the same WaybackCollection. Using AccessPoints, you can also provide different levels of access to a collection. For example, users within a particular subnet may be able to access all documents within a collection via one AccessPoint, but users outside that subnet may be restricted to viewing documents allowed by a web sites current robots.txt file.

Please refer to wayback.xml within the wayback .war file for detailed example AccessPoint configurations.

WaybackCollection Configuration

A WaybackCollection's configuration must specify the following implementations:

  • resourceStore the specific implementation used to specific set of documents within this collection, and how to access them for Replay requests.
  • resourceIndex the specific implementation responsible for locating documents within the collection.

A WaybackCollection's configuration may optionally specify the following:

  • shutdownables - an List of one or more beans implementing org.archive.wayback.Shutdownable needed to maintain this WaybackCollection, typically Daemon Threads which perform automatic indexing operations on the resourceStore and the resourceIndex.

For more information on WaybackCollection configuration options and automatic indexing, please refer to the following documentation pages and to the example Spring .xml configuration files within the wayback .war:

Replay Modes

There are presently 3 Replay modes supported by the Wayback software, Archival URL mode, Proxy mode, and an experimental DomainPrefix mode.

Archival URL Replay Mode

Archival URL Replay mode uses a modified URL to designate documents stored in ARC/WARC files. The general form of an Archival URL is:

http://HOSTNAME:PORT/CONTEXT/TIMESTAMP/URL


where

  • HOSTNAME is the host where the Wayback software is running.
  • PORT is the port where Tomcat is listening for incoming HTTP requests, which also refers to part of the name of the Access Point. See below for example CONTEXT mappings.
  • CONTEXT is an optional context where the Wayback webapp has been deployed, plus an optional name of the Access Point within the webapp. See below for example CONTEXT mappings.
  • TIMESTAMP is 0 to 14 digits of a date, possibly followed by an asterisk ('*'), or one or more tags providing further specifics for the request. The format of a TIMESTAMP is:
    YYYYMMDDHHmmss
    where
    • YYYY represents a 4-digit year
    • MM represents a 2-digit, 1-based month (Jan = 1 - Dec = 12)
    • DD represents a 2-digit day of the month (01-31)
    • HH represents a 2-digit hour (01-24)
    • mm represents a 2-digit minute (00-59)
    • ss represents a 2-digit second (00-59)
    The following are example dates expressed as 14-digit Timestamps:

    Jan 13, 1999 03:34:35 (am UTC) - 19990113033435


    Dec 31, 2004 23:01:00 (pm UTC) - 20041231230100


    Following the date portion of a timestamp, the following flags can be appended:

    • id_ Identity - perform no alterations of the original resource, return it as it was archived.
    • js_ Javascript - return document marked up as javascript.
    • cs_ CSS - return document marked up as CSS.
    • im_ Image - return document as an image.
  • URL represents the actual URL that should be replayed.


For some simple and more elaborate examples of how AccessPoint bean names interact with Archival URLs, please refer to Access Point Naming.


Archival URL mode allows replay of all versions captured of a particular URL, by modifying the Timestamp. When an Archival URL Replay request is received for a URL, the Wayback Machine will replay the closest version in time to the Timestamp requested of the particular URL.


HTML documents returned in Archival URL Replay mode are modified from the original version to provide a replay experience more consistent to viewing the original content. This is accomplished by one of two methods. The first includes modification of a subset of the HTML tags on the server, combined with the insertion of JavaScript into the HTML page. This JavaScript executes in the client browser after the page has loaded, and modifies the remaining URLs within the HTML page, both Anchors (links) as well as embedded content (images, applets, etc) so that they become appropriate Archival URL requests back to the Wayback application. The second method involves rewriting all HTML tags within the page on the server, to make embedded URLs point back into the Wayback application.


Currently, we are recommending the entirely server-side rewriting method, and are deprecating the original server-side plus Javascript method, but this functionality is still available in Wayback. Neither method is perfect, not all URLs are rewritten correctly, particularly URLs that are created by JavaScript in the original pages, and specialized file types containing links like Flash and PDF documents.


The properties parser and uriConverter for Archival URL Access Points must be set to the following implementations:


    <property name="parser">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"
        init-method="init">
        <property name="maxRecords" value="1000" />
        <property name="earliestTimestamp" value="1996" />
      </bean>
    </property>

    <property name="uriConverter">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
        <property name="replayURIPrefix" value="http://wayback.example.org:8080/collection/" />
      </bean>
    </property>

          
configuration optional/required description
maxRecords optional Sets the default maximum requested records for Archival URL query requests.
earliestTimestamp optional Set the default start date for requested records for Archival URL query requests.
replayURIPrefix required Points to the Archival URL prefix of the Access Point as illustrated in Access Point Path Configuration document.

For additional configuration examples and information about ArchivalUrl Replay mode, please see the file ArchivalUrlReplay.xml

Proxy Replay Mode

Wayback can be configured to act as an HTTP proxy server. To utilize this mode, the wayback webapp must be deployed as the ROOT context, no other AccessPoints can use the port dedicated to the Proxy AccessPoint, and client browsers must be configured to proxy all HTTP requests through the Wayback Machine application. Instead of retrieving documents from the live web, the Wayback Machine will retrieve documents from the configured WaybackCollection.

Proxy Replay mode does not suffer from the shortcomings of the inserted Javascript that the Archival URL mode uses, all URLs function as they did originally, but there can be another drawback to using this feature: no date information is sent with each request. Wayback attempts to address this problem by associating the date clicked on query pages when a Replay session is begun, with the users IP address. This can fail to work properly in situations where multiple users are behind a NAT system which causes them to appear to have the same IP address.

Additionally, there is an experimental Firefox-specific plugin developed by Oskar Grenholm, which provides a novel interface to navigate between different captured versions of a page within Proxy mode, and also sends a special HTTP header which allows Wayback to uniquely associate the correct date with browsers, even those behind a NAT system. You can find out more about this plugin and download it here .

Thanks Oskar!

The following is an example Proxy Replay Access Point definition. It assumes to be running on a host wayback.somehost.org, that a Tomcat Connector has been added for port 8090, that the Wayback webapp has been deployed at the ROOT context, and that another Archival URL Access Point named "8080:wayback" has been configured.


<bean name="8090" parent="8080:wayback">
  <property name="queryPrefix" value="http://wayback.somehost.org/" />
  <property name="replay"> ref="proxyreplay" />
  <property name="uriconverter">
    <bean class="org.archive.wayback.proxy.RedirectResultURIConverter">
      <property name="redirectURI" value="http://wayback.somehost.org/jsp/Redirect.jsp" />
    </bean>
  </property>
  <property name="parser">
    <bean class="org.archive.wayback.proxy.ProxyRequestParser" >
      <property name="localhostNames">
        <list>
          <value>wayback.somehost.org</value>
        </list>
      </property>
      <property name="maxRecords" value="1000" />
    </bean>
  </property>
</bean>

          

redirectURI is required, and must be set to the name of the host where the Wayback application is running. If this is not the primary name of the machine running the Wayback application, then you may need to also specify the hostname used for the Wayback application in the localhostNames configuration list.

For additional configuration examples and information about Proxy Replay mode, please see the file ProxyReplay.xml

DomainPrefix Replay Mode

Wayback includes an additional, experimental Replay mode which is similar to Archival URL mode, in that any document can be refernced as a global URL, without any browser configuration requirements. This mode requires deploying the Wayback webapp in ROOT context, and a special DNS wildcard aliasing, so that all hostnames with a common suffix will be directed to your host running Wayback.

The general form of a DomainPrefix URL is:

http://TIMESTAMP.ARCHIVE-HOSTNAME.WAYBACK-HOSTNAME:PORT/ARCHIVE-PATH

Here is an example DomainPrefix URL, on an assumed host wayback.somehost.org, with a wayback webapp deployed as ROOT, via the Access Point named 8081 (which indicates the port Wayback requests will be recieved on) for the page http://www.yahoo.com/foo.gif on Dec 31, 1999 at 12:00:00 UTC.

http://19991231120000.www.yahoo.com.wayback.somehost.org:8081/foo.gif

This mode performs all URL rewriting on the server side, so needs no client-side Javascript to execute, and also does not suffer from some of the request leakage problems present in Archival URL mode. It presently is somewhat naive about rewriting links within returned documents, and will also rewrite URLs in the text of pages (not desired), as well as URLs referenced within the page (desired).

For additional configuration examples and information about Domain Prefix Replay mode, please see the files wayback.xml and DomainPrefixReplay.xml .

Wayback UI customization options

Wayback provides several opportunities for customizing the user interface presented to users, which can be grouped into 4 categories:

  • Query UI rendering .jsp files.
  • Replay insert .jsp files.
  • Exception rendering .jsp files.
  • Localization .properties files.

Query UI

All content returned by Wayback in response to Query requests is generated by .jsp files, which are executed and provided access to the results found within the ResourceIndex. Wayback is distributed with several sample implementations.

To alter the default behavior, you may either provide your own .jsp files, and configure the Renderer to use them instead of the default .jsp files, or the default .jsp files may be modified directly.

  • captureJsp - used when the request indicates that a listing of all dates available for a single URL should be returned. Default is /WEB-INF/query/HTMLCaptureResults.jsp. An alternate implementation, /WEB-INF/query/CalendarResults.jsp will generate HTML output similar to the global Wayback Machine service.
  • urlJsp - used when the request indicates that a summary of captures available for a number of URLs should be returned. Default is /WEB-INF/query/HTMLUrlResults.jsp
  • xmlCaptureJsp - used when the request indicates that a listing of all dates available for a single URL should be returned in XML format. Default is /WEB-INF/query/XMLCaptureResults.jsp.
  • xmlUrlJsp - used when the request indicates that a summary of captures available for a number of URLs should be returned in XML format. Default is /WEB-INF/query/XMLUrlResults.jsp

Replay Inserts

Wayback allows for embedding additional content within replayed HTML pages in all Replay modes. This is accomplished by executing one or more .jsp files with access to context information about the request, the results, and the actual Resource being returned. The output of each .jsp file is included within the returned page.

Wayback is distributed with several example .jsp insert files that can be used as is, modified to suit installation requirements, or used as examples for more elaborate customizations:

  • /WEB-INF/replay/ArchiveComment.jsp inserts an HTML comment indicating when the document was captured and retrieved.
  • /WEB-INF/replay/ClientSideJSInsert.jsp inserts some Javascript into the returned HTML page that updates links, images, and other embedded content, attempting to make all URL references within the page point back into the Wayback service.
  • /WEB-INF/replay/DebugBanner.jsp Not intended for production use, but a slightly more complex jsp insert example that demonstrates how to access various request context data, and is sometimes useful for debugging.
  • /WEB-INF/replay/Disclaimer.jsp Inserts a small banner at the top of replayed HTML pages, alerting users that they are viewing an archived page, and providing some information about the particular capture.
  • /WEB-INF/replay/JSLessTimeline.jsp Inserts a banner in the top of replayed documents which allows users to navigate directly between other captures of the current page they are viewing. This version does not use Javascript to place the banner, so it will appear in all HTML pages within a frameset.
  • /WEB-INF/replay/Timeline.jsp Inserts a banner in the top of replayed documents which allows users to navigate directly between other captures of the current page they are viewing. This version uses Javascript to place the banner, attempting to only place the banner in the largest frame within a frameset.
  • /WEB-INF/replay/Toolbar.jsp Inserts a fancier banner in the top of replayed documents which includes a graphic representaion of the number of captures over time and allows users to navigate directly between other captures of the current page they are viewing. This version uses Javascript to place the banner, attempting to only place the banner in the largest frame within a frameset.

Exception Rendering

Wayback is distributed with a default ExceptionRenderer that allows customization of several types of anticipated exceptions that can occur through normal operations. The BaseExceptionRenderer allows installations to provide alternate .jsp files which are executed, and the output of these .jsp files are returned to end users. To alter the default behavior, you may either provide your own .jsp files, and configure the BaseExceptionRenderer to use them instead of the default .jsp files, or the default .jsp files may be modified directly.

  • xmlErrorJsp - used when the request indicates that XML data should be returned. Default is /WEB-INF/exception/XMLError.jsp
  • errorJsp - used for HTML Replay exceptions, and for all Query exceptions. Default is /WEB-INF/exception/HTMLError.jsp
  • imageErrorJsp - used when the request appears to be an embedded Replay request that expects an image to be returned. Default is /WEB-INF/exception/HTMLError.jsp which produced HTML output. This may be desirable over returning an actual image, since web browsers will usually show any HTML alternate text associated with the image in place of the image when image data is not returned. Wayback also includes a 1x1 pixel gif, error_image.gif, which can be used to display a gray box in place of images requests that result in an exception.
  • javascriptErrorJsp - used when the request appears to be an embedded Replay request that expects Javascript content to be returned. Default is /WEB-INF/exception/JavaScriptError.jsp
  • cssErrorJsp - used when the request appears to be an embedded Replay request that expects CSS content to be returned. Default is /WEB-INF/exception/CSSError.jsp

Localization .properties files.

Wayback is packaged with a set of reference implementation .jsp files for generating Query, Replay, and Exception user interface pages. References to actual user visible text is abstracted within these .jsp files so the specific text to display in various pages are read from a .properties file. Wayback will automatically search for a Locale-specific .properties file from which these text values should be loaded, allowing the language presented to users to be changed.

By default, Wayback will use the language preference indicated by the users web browser to find an appropriate .properties files, defaulting to the standard English text if the users preferred language is not available. Particular AccessPoints can be forced to a particular Locale using the AccessPoint.locale property.

Several language customization .property files have already been contributed by users in the community and are now included with the standard Wayback distribution. We plan for a completely new and improved UI implementation for version 1.6, and plan a more active outreach program to create customizations in as many languages as possible once this new UI is completed, and the required text elements are determined.

Excluding Documents within an AccessPoint

Excluding Documents with live Robots.txt

Documents may be excluded from access within an Access Point by retroactively enforcing the policies in a web sites live robots.txt documents by adding the following configuration in the Access Point.

<property name="exclusionFactory" ref="excluder-factory-robot" />

        


Please see the default wayback.xml packaged with this software for an example bean definition for the referenced excluder-factory-robot bean.

Excluding Documents with an Administrative List

Documents may be excluded from access within an Access Point by using a plain text file listing URL prefixes which should be blocked. If this option is used with a non-zero value for checkInterval, the Wayback software will monitor the external file, and will automatically reload the file when it changes.

The following Spring configuration defines a static exclusion file that causes URLs listed in the file /tmp/exclude.txt to be blocked, with the file being checked for updates every 10 minutes.

<bean id="static-exclusion" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory" init-method="init">
  <property name="file" value="/tmp/exclude.txt" />
  <property name="checkInterval" value="600" />
</bean>

        


Adding the following configuration to an Access Point will cause the excluded URLs named in /tmp/exclude.txt to be inaccessible:

<property name="exclusionFactory" ref="static-exclusion">

        

Restricting who can interact with an AccessPoint

Limiting Access based on IP Addresses

Access to a particular Access Point can be limited to a specific IP address range by adding the following configuration to an Access Point definition.

<property name="authentication">
  <bean class="org.archive.wayback.authenticationcontrol.IPMatchesBooleanOperator">
    <property name="allowedRanges">
      <list>
        <value>192.168.1.16/24</value>
      </list>
    </property>
  </bean>
</property>

        
which would have the affect of blocking users outside the 192.168.1.16/24 network.

Limiting Access based on HTTP BASIC Authentication

Access can be restricted to a particular Access Point using Tomcat's built-in configuration options. By adding the following configuration to the web.xml, which assumes an Access Point named "8080:usersecure" (or really for any port):

<security-role>
  <description>Secured-Wayback</description>
  <role-name>wayback</role-name>
</security-role>

<security-constraint>
  <web-resource-collection>
    <web-resource-name>Secured-Wayback</web-resource-name>
    <url-pattern>/usersecure/*</url-pattern>
  </web-resource-collection>
  <auth-constraint>
    <role-name>wayback</role-name>
  </auth-constraint>
</security-constraint>

<login-config>
  <auth-method>BASIC</auth-method>
  <realm-name>Secured-Wayback</realm-name>
</login-config>

        




And then adding user configuration to the tomcat-users.xml file:

<role rolename="wayback"/>
<user password="changeM3" roles="wayback" name="brad"/>

        

Adding Additional Configurations to an AccessPoint

The following configuration can be added to an Access Point:


<property name="configs">
        <props>
                <prop key="inst">Acrobatic Association</prop>
                <prop key="logo">http://images.somehost.com/logos/acro.jpg</prop>
        </props>
</property>

        

These configurations are then accessible in the common .jsp rendering pages, allowing Collection or Access Point specific text to be relayed to shared .jsp files, which can then retrieve the Access Point specific configuration with the following code:


UIResults results = UIResults.getGeneric(request);
String instString = results.getContextConfig("inst");
String logoString = results.getContextConfig("logo");

...

        

External Tools

The wayback distribution includes several command-line tools that assist in creating and testing index files, and populating the ArcProxy location db.

All the command line tools can be found which can be found underneath the directory where you unpacked your distribution at:bin/* (example: bin/location-client).

bdb-client

This tool allows several maintenance operations to be performed on BDB files. There are two primary modes, read and write.

  1. bin/bdb-client -r BDB_DIR BDB_NAME [PREFIX]
    Output records from a BDB database on STDOUT.
    where:
    • BDB_DIR Open BDB in this directory.
    • BDB_NAME Open BDB with this name.
    • PREFIX (optional) if present, only output records whose KEY begins with PREFIX. If this option is omitted, all records will be output from the BDB. Records are always output in sorted order.
  2. bin/bdb-client -w BDB_DIR BDB_NAME
    Read CDX format lines from STDIN, and insert into a BDB, creating the BDB if needed.
    where:
    • BDB_DIR Open BDB in this directory.
    • BDB_NAME Open BDB with this name.

bin-search

This tool allows binary searching against large sorted text files. It will output lines prefixed with a particular key on STDOUT.

bin/bin-search KEY FILE [FILE2 ...]

  • KEY String prefix for lines that should be output.
  • FILE [FILE2 ...] Search through all files specified, outputting the lines prefixed with KEY from each file in a single, sorted stream. This assumes that all FILE arguments are sorted.

cdx-indexer

These tools create a CDX format index for the ARC/WARC file at PATH, either on STDOUT, or at the path specified by CDX_PATH. The resulting file can be sorted and merged with other CDX format index files to generate CDX format ResourceIndex.

            bin/cdx-indexer [-identity] PATH [CDX_PATH]
          

Note that when manually constructing CDX files using these tools, you must set the environment variable LC_ALL=C when using the standard UNIX sort command line tool.

The -identity option causes the tools to skip canonicalization of URLs. When using this option, you will need to pass the CDX records through the url-client tool before sorting them into a production CDX index. See the documentation for the url-client tool, and the URL Canonicalization section for more information.

location-client

If you have already populated your ResourceIndex, and just need to inform the ArcProxy LocationDB of where ARC files are located. This script will allow you to synchronize the ArcProxy LocationDB with the directories holding your ARC files.

Execute the script once for each directory containing ARC files (on each machine containing ARC files.) Again, this script will not index the content of the ARC files, but will only populate the ArcProxy LocationDB with the locations of ARC files.

bin/location-client sync LOCATION_URL ARC_DIR ARC_URL_PREFIX

where:

  • LOCATION_URL is the absolute URL where the FileProxy can be accessed. ex. http://wayback-webapp.your-archive.org:8080/locationdb/locationDB
  • ARC_DIR is the absolute path to the directory on the local machine which holds ARC files ex. /2/arc-collection-1
  • ARC_URL_PREFIX is the absolute URL where the directory ARC_DIR can be accessed. ex. http://arc-storage-node-1.your-archive.org/2/arc-collection-1/

url-client

URLs stored in BDB and CDX format ResourceIndexes are canonicalized to a more generic form. Before performing a lookup operation on the ResourceIndex, the same canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column altered. The column that is changed is assumed to be an URL, and the version output is the canonicalized form of the input URL.

This tool is required when using the cdx-indexer tool with the -identity option. Typical usage involves generating an identity CDX index, then passing the lines in that index through this tool to canonicalize the record URL key for queries. If the identity CDX files are kept, then canonicalization schemes can be swapped without reindexing the original ARC/WARC content. This tool can also be useful for debugging the canonicalization function. See the section URL Canonicalization for more information.

bin/url-client [-cdx] [-d DELIMITER] [-f FIELD] [-f FIELD2] ...

  • -cdx Pass thru lines prefixed with " CDX " unchanged.
  • -d DELIMITER Use DELIMITER as to separate fields instead of default Space(" ").
  • -f FIELD alter column FIELD of each line, instead of the default column 1. If specified multiple times, then each column will be canonicalized in transformed lines.

FileProxy and LocationDB application

The Wayback software includes an additional application, the FileProxy, which can simplify some distributed ResourceStore implementations. The FileProxy application exposes two external services, one used to configure the underlying database mapping ARC/WRC filenames to the actual, fully qualified HTTP 1.1 URL or local path, and a second service which reverse proxies incoming HTTP 1.1 range requests to appropriate back-end storage nodes.

The fileproxy reverse proxy service allows one or more SimpleResourceStore instances to configure a single URL prefix where all ARC/WARC files are assumed to be located. This reverse proxy then uses a BDB JE to find the actual current location of the ARC/WARC file, and forward the request to the actual host holding the ARC/WARC file.

The locationdb service allows population and management of the BDB JE database(the locationDB) used by the fileproxy service. There is also a command line tool, location-client described elsewhere in this document which provides command line access to the management of the locationDB.

Adding the following configuration to wayback.xml will expose the fileproxy and locationdb services:


<bean id="filelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB"
  init-method="init">
  <property name="bdbPath" value="/tmp/wayback/file-db/db/" />
  <property name="bdbName" value="DB1" />
  <property name="logPath" value="/tmp/wayback/file-db/db.log" />
</bean>

<bean name="8080:fileproxy" class="org.archive.wayback.resourcestore.locationdb.FileProxyServlet">
  <property name="locationDB" ref="filelocationdb" />
</bean>

<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.locationdb.ResourceFileLocationDBServlet">
  <property name="locationDB" ref="filelocationdb" />
</bean>