public class CDXToCaptureSearchResultsWriter extends CDXToSearchResultWriter
CDXToSearchResultWriter
for producing CaptureSearchResults
.
Also resolves revisits and sets closest.
Modifier and Type | Field and Description |
---|---|
protected CaptureSearchResult |
closest |
protected HashMap<String,CaptureSearchResult> |
digestToOriginal |
protected HashMap<String,LinkedList<CaptureSearchResult>> |
digestToRevisits |
protected boolean |
done |
protected ExclusionFilter |
exclusionFilter |
protected int |
flip |
protected boolean |
includeBlockedCaptures |
protected boolean |
isReverse |
protected String |
preferContains |
protected org.archive.format.cdx.CDXLine |
prevLine |
protected CaptureSearchResult |
prevResult |
protected boolean |
resolveRevisits |
protected CaptureSearchResults |
results |
static String |
REVISIT_VALUE |
protected boolean |
seekSingleCapture |
protected SelfRedirectFilter |
selfRedirFilter |
protected String |
targetTimestamp |
msg, query
Constructor and Description |
---|
CDXToCaptureSearchResultsWriter(CDXQuery query,
boolean resolveRevisits,
boolean seekSingleCapture,
String preferContains)
Initialize with CDXQuery and other options.
|
Modifier and Type | Method and Description |
---|---|
void |
begin()
This method will be called just before looping over
the sequence of CDX lines.
|
protected CaptureSearchResult |
determineClosest(CaptureSearchResult nextResult) |
void |
end()
Called at the end.
|
CaptureSearchResult |
getClosest() |
ExclusionFilter |
getExclusionFilter()
Deprecated.
|
protected CaptureSearchResult |
getLastAdded() |
CaptureSearchResults |
getSearchResults() |
SelfRedirectFilter |
getSelfRedirFilter() |
boolean |
isAborted() |
boolean |
isIncludeBlockedCaptures() |
void |
setExclusionFilter(ExclusionFilter exclusionFilter)
Deprecated.
2014-11-10 Use new implementation
AccessPoint.setExclusionFactory(org.archive.wayback.accesscontrol.ExclusionFilterFactory) |
void |
setIncludeBlockedCaptures(boolean includeBlockedCaptures)
set to
true if blocked captures are to be included
in the result. |
void |
setSelfRedirFilter(SelfRedirectFilter selfRedirFilter) |
void |
setTargetTimestamp(String timestamp) |
int |
writeLine(org.archive.format.cdx.CDXLine line)
Process
line . |
getErrorMsg, getQuery, modifyOutputFormat, printError, writeResumeKey
close, printNumPages, serverError, setContentType, setMaxLines, trackLine, writeMiscLine
public static final String REVISIT_VALUE
protected CaptureSearchResults results
protected String targetTimestamp
protected int flip
protected boolean done
protected CaptureSearchResult closest
protected SelfRedirectFilter selfRedirFilter
protected ExclusionFilter exclusionFilter
protected CaptureSearchResult prevResult
protected org.archive.format.cdx.CDXLine prevLine
protected HashMap<String,CaptureSearchResult> digestToOriginal
protected HashMap<String,LinkedList<CaptureSearchResult>> digestToRevisits
protected boolean resolveRevisits
protected boolean seekSingleCapture
protected boolean isReverse
protected String preferContains
protected boolean includeBlockedCaptures
public CDXToCaptureSearchResultsWriter(CDXQuery query, boolean resolveRevisits, boolean seekSingleCapture, String preferContains)
This class generates CaptureSearchResult
in chronological
order, even when CDXQuery.isReverse()
is true
.
Note: preferContains
parameter is specifically intended for
choosing one out of two copies of the identical capture record in different
storage locations. For example, If WARCs in staging area are made available
for replay through secondary index, there may be a period where one capture
is indexed in both main and secondary index, with different filename
field. If preferContains
is set, CDX line that has preferContains
as substring in filename
will be picked over others that does not.
It can be used, for example, to put higher preference on the archive in primary
storage area.
query
- CDXQueryresolveRevisits
- Whether to resolve revisit capturesseekSingleCapture
- Whether just one capture is wanted.
(Only effective when resolveRevisits
is also true
.)preferContains
- Preferred archive filename substring. If
non-null
, It picks capture in the archive with a given substring
in its filename, out of multiple captures of the same timestamp, original
URL, length and offset (if any).public void setTargetTimestamp(String timestamp)
public void begin()
BaseProcessor
begin()
on nested processor.begin
in interface BaseProcessor
begin
in class CDXToSearchResultWriter
public int writeLine(org.archive.format.cdx.CDXLine line)
BaseProcessor
line
.line
- CDXLine
line
is sent to output, 0 otherwise.protected CaptureSearchResult determineClosest(CaptureSearchResult nextResult)
public void end()
BaseProcessor
end()
on nested processor.end
in interface BaseProcessor
end
in class CDXToSearchResultWriter
public CaptureSearchResult getClosest()
protected CaptureSearchResult getLastAdded()
public CaptureSearchResults getSearchResults()
getSearchResults
in class CDXToSearchResultWriter
public SelfRedirectFilter getSelfRedirFilter()
public void setSelfRedirFilter(SelfRedirectFilter selfRedirFilter)
@Deprecated public ExclusionFilter getExclusionFilter()
public void setExclusionFilter(ExclusionFilter exclusionFilter)
AccessPoint.setExclusionFactory(org.archive.wayback.accesscontrol.ExclusionFilterFactory)
null
, the filter will be applied before revisit
resolution.
Note: there is no class using this property in baseline Wayback.
You need to write a custom class to utilize this property.
See CDXServer
and LocalResourceIndex
for other ways of configuring exclusion filters.
This method is deprecated because this can run exclusion after timestamp deduplication, which results in undesirable capture search results. Exclusion should happen in regular CDXServer pipeline. This method was necessary to implement collection sensitive exclusion filter. New exclusion filter factory addresses such needs in ordinary CDX filtering pipeline.
public boolean isIncludeBlockedCaptures()
public void setIncludeBlockedCaptures(boolean includeBlockedCaptures)
true
if blocked captures are to be included
in the result.
This is a tentative property and specifically intended for looking up revisit original for URL-agnostic revisits. May change in the future.
includeBlockedCaptures
- Copyright © 2005–2015 IIPC. All rights reserved.