The Location Database provides a mapping between ARC/WARC file names and the absolution location of those ARC/WARC files. Absolute location, in this case, can refer to either HTTP URLs or absolute paths to files on the local file system.
Whenever locations are added for a new filename that was not previously present in the location database, a record (in this case a line) is added to a log file. This log file can then be used to determine which files have been seen by the location database. The ResourceFileLocationDatabase interface includes methods to retrieve the current length of this log file, and to return an iterator with all records between two points in the log. This interface allows an observer to poll the location database to create events when new files are added to the underlying database.
Wayback includes 5 Thread/Worker classes to enable automatic indexing of new content:
Wayback allows for several configurations enabling diverse collection sizes and distribution of ARC/WARC files across many local directories or across many servers. For most configurations, the default LocationDBResourceStore will suffice, but Wayback is distributed with 2 additional classes, FileProxy and SimpleResourceStore, which provide an opportunity to insert a single HTTP caching server between the Wayback service and an ARC/WARC storage cluster.
This implementation uses a LocationDB to convert ARC/WARC filenames into absolute paths, or HTTP URLs. The underlying LocationDB can be managed by the automatic indexing threads as described above, or it can be manually managed with the location-client command line tool. Be sure to enable the org.archive.wayback.resourcestore.locationdb.FileProxyServlet if you plan to manage the LocationDB manually.
This configuration depends on all ARC/WARC files appearing within a single HTTP 1.1 exported root directory, or within a single local directory. ARC/WARC file names are appended to a common prefix, either a local directory on the host running Wayback, or under a single remote directory.
The FileProxyServlet can be used to make all ARC/WARC files accessible within a single HTTP directory, acting as a reverse proxy to the actual host holding the ARC/WARC files. The FileProxyServlet uses a LocationDB to translate requested ARC/WARC filenames into the actual location of each file.
When using the automatic indexing functionality, you need to provide a list of ResourceFileSource objects to the ResourceFileSourceUpdater class. Wayback currently contains 2 ResourceFileSource implementations: