- All Implemented Interfaces:
- ObjectFilter<CaptureSearchResult>
public class ConditionalGetAnnotationFilter
extends Object
implements ObjectFilter<CaptureSearchResult>
WARC file allows 2 forms of deduplication. The first actually downloads
documents and compares their digest with a database of previous values. When
a new capture of a document exactly matches the previous digest, an
abbreviated record is stored in the WARC file. The second form uses an HTTP
conditional GET request, sending previous values returned for a given URL
(etag, last-modified, etc). In this case, the remote server either sends a
new document (200) which is stored normally, or the server will return a
304 (Not Modified) response, which is stored in the WARC file.
For the first record type, the wayback indexer will output a placeholder
record that includes the digest of the last-stored record. For 304 responses,
the indexer outputs a normal looking record, but the record will have a
SHA1 digest which is easily distinguishable as an "empty" document. The SHA1
is always:
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
This class will observe a stream of SearchResults, storing the values for
the last seen non-empty SHA1 field. Any subsequent SearchResults with an
empty SHA1 will be annotated, copying the values from the last non-empty
record.
This is highly experimental.
- Version:
- $Date$, $Revision$
- Author:
- brad