public class EmbeddedCDXServerIndex extends AbstractRequestHandler implements MementoHandler, ResourceIndex
Modifier and Type | Field and Description |
---|---|
protected String |
baseStatusFilter |
protected String |
baseStatusRegexp |
protected UrlCanonicalizer |
canonicalizer |
protected CDXServer |
cdxServer |
protected org.archive.format.cdx.CDXInputSource |
extraSource |
protected List<String> |
ignoreRobotPaths |
protected int |
limit |
protected String |
preferContains |
protected String |
remoteCdxPath |
static String |
REQUEST_REVISIT_LOOKUP
WaybackRequest parameter name for telling
EmbeddedCDXServerIndex that it's looking up a specific single
capture needed for replaying URL-agnostic revisit. |
protected SelfRedirectFilter |
selfRedirFilter |
protected int |
timestampDedupLength |
protected boolean |
tryFuzzyMatch |
Constructor and Description |
---|
EmbeddedCDXServerIndex() |
Modifier and Type | Method and Description |
---|---|
void |
addTimegateHeaders(javax.servlet.http.HttpServletResponse response,
CaptureSearchResults results,
WaybackRequest wbRequest,
boolean includeOriginal) |
protected static String |
buildStatusFilter(String regexp) |
protected AuthToken |
createAuthToken(WaybackRequest wbRequest,
String urlkey)
robots.txt may be ignored for embedded resources (CSS, images, javascripts)
robots.txt may be ignored if
urlkey starts with any of ignoreRobotPaths
|
protected CDXQuery |
createQuery(WaybackRequest wbRequest,
boolean isFuzzy)
|
protected org.archive.util.iterator.CloseableIterator<String> |
createRemoteIter(String urlkey,
org.archive.util.binsearch.impl.HTTPSeekableLineReader reader) |
SearchResults |
doQuery(WaybackRequest wbRequest) |
String |
getBaseStatusRegexp() |
UrlCanonicalizer |
getCanonicalizer() |
protected CDXToCaptureSearchResultsWriter |
getCaptureSearchWriter(WaybackRequest wbRequest,
AuthToken waybackAuthToken,
boolean isFuzzy)
create
CDXWriter for writing capture search result. |
CDXServer |
getCdxServer() |
org.archive.format.cdx.CDXInputSource |
getExtraSource() |
List<String> |
getIgnoreRobotPaths() |
int |
getLimit() |
String |
getPreferContains() |
String |
getRemoteAuthCookie() |
String |
getRemoteAuthCookieIgnoreRobots() |
org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory |
getRemoteCdxHttp() |
String |
getRemoteCdxPath() |
SelfRedirectFilter |
getSelfRedirFilter() |
int |
getTimestampDedupLength() |
protected CDXToSearchResultWriter |
getUrlSearchWriter(WaybackRequest wbRequest) |
boolean |
handleRequest(javax.servlet.http.HttpServletRequest httpRequest,
javax.servlet.http.HttpServletResponse httpResponse)
Possibly handle an incoming HttpServletRequest, much like a normal
HttpServlet, but includes a return value.
|
boolean |
isTryFuzzyMatch() |
protected void |
loadWaybackCdx(String urlkey,
WaybackRequest wbRequest,
CDXQuery query,
AuthToken waybackAuthToken,
CDXToSearchResultWriter resultWriter,
boolean fuzzy) |
SearchResults |
query(WaybackRequest wbRequest)
Transform a WaybackRequest into a ResourceResults.
|
protected void |
remoteCdxServerQuery(String urlkey,
CDXQuery query,
AuthToken authToken,
CDXToSearchResultWriter resultWriter) |
boolean |
renderMementoTimemap(WaybackRequest wbRequest,
javax.servlet.http.HttpServletRequest request,
javax.servlet.http.HttpServletResponse response) |
void |
setBaseStatusRegexp(String baseStatusRegexp)
filter on
statuscode field applied by default for interactive
CDX lookup (i.e. from Wayback UI, not via CDX Server API). |
void |
setCanonicalizer(UrlCanonicalizer canonicalizer) |
void |
setCdxServer(CDXServer cdxServer) |
void |
setExtraSource(org.archive.format.cdx.CDXInputSource extraSource) |
void |
setIgnoreRobotPaths(List<String> ignoreRobotPaths) |
void |
setLimit(int limit) |
void |
setPreferContains(String preferContains)
substring of
filename field identifying preferred
archive among multiple copies of the same capture. |
void |
setRemoteAuthCookie(String remoteAuthCookie) |
void |
setRemoteAuthCookieIgnoreRobots(String remoteAuthCookieIgnoreRobots) |
void |
setRemoteCdxHttp(org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory remoteCdxHttp) |
void |
setRemoteCdxPath(String remoteCdxPath) |
void |
setSelfRedirFilter(SelfRedirectFilter selfRedirFilter) |
void |
setTimestampDedupLength(int timestampDedupLength)
The number of digits of timestamp used for culling (deduplicating)
captures in CDX query result.
|
void |
setTryFuzzyMatch(boolean tryFuzzyMatch) |
void |
shutdown()
Release any resources used by this ResourceIndex cleanly
|
getAccessPointPath, getBeanName, getInternalPort, getMapParam, getMapParamOrEmpty, getRequiredMapParam, getServletContext, registerPortListener, setAccessPointPath, setBeanName, setInternalPort, setServletContext, translateRequestPath, translateRequestPathQuery
protected CDXServer cdxServer
protected int timestampDedupLength
protected int limit
protected UrlCanonicalizer canonicalizer
protected SelfRedirectFilter selfRedirFilter
protected String remoteCdxPath
protected org.archive.format.cdx.CDXInputSource extraSource
protected String preferContains
protected boolean tryFuzzyMatch
protected String baseStatusRegexp
protected String baseStatusFilter
public static final String REQUEST_REVISIT_LOOKUP
WaybackRequest
parameter name for telling
EmbeddedCDXServerIndex
that it's looking up a specific single
capture needed for replaying URL-agnostic revisit.
Defined here, without setter/getter, because this is an experimental parameter supporting soft-block feature. It's very likely to change.
public SearchResults query(WaybackRequest wbRequest) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, AccessControlException
ResourceIndex
query
in interface ResourceIndex
wbRequest
- WaybackRequest object from RequestParserResourceIndexNotAvailableException
- if the ResourceIndex
is not available (remote host down, local files missing, etc)ResourceNotInArchiveException
- if the ResourceIndex could be
contacted, but no SearchResult objects matched the requestBadQueryException
- if the WaybackRequest is lacking information
required to make a reasonable search of this ResourceIndexAccessControlException
- if SearchResult objects actually matched,
but could not be returned due to AccessControl restrictions
(robots.txt documents, Administrative URL blocks, etc)protected AuthToken createAuthToken(WaybackRequest wbRequest, String urlkey)
urlkey
starts with any of ignoreRobotPaths
wbRequest
- urlkey
- AuthToken
representing user's privileges on urlkey
.public SearchResults doQuery(WaybackRequest wbRequest) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, AccessControlException
protected void loadWaybackCdx(String urlkey, WaybackRequest wbRequest, CDXQuery query, AuthToken waybackAuthToken, CDXToSearchResultWriter resultWriter, boolean fuzzy) throws IOException, AccessControlException
IOException
AccessControlException
protected CDXQuery createQuery(WaybackRequest wbRequest, boolean isFuzzy)
CDXQuery
that is sent to CDXServer
.
The query specifies standard CDX server params described at:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
Note: this method adds extra filters meant for interactive (Wayback UI)
use. CDXServer web API should not use this method. this method is used
for replay and capture-search requests only.
TODO: move this to CDXQuery
as static method.wbRequest
- WaybackRequest
either replay or capture-queryisFuzzy
- unused (?)CDXQuery
object for the supplied request.protected void remoteCdxServerQuery(String urlkey, CDXQuery query, AuthToken authToken, CDXToSearchResultWriter resultWriter) throws IOException, AccessControlException
IOException
AccessControlException
protected org.archive.util.iterator.CloseableIterator<String> createRemoteIter(String urlkey, org.archive.util.binsearch.impl.HTTPSeekableLineReader reader) throws IOException
IOException
protected CDXToCaptureSearchResultsWriter getCaptureSearchWriter(WaybackRequest wbRequest, AuthToken waybackAuthToken, boolean isFuzzy)
CDXWriter
for writing capture search result.
possible future changes:
waybackAuthToken
wbRequest
- WaybackRequest
for configuring CDXQuery
waybackAuthToken
- unusedisFuzzy
- true
to enable fuzzy queryprotected CDXToSearchResultWriter getUrlSearchWriter(WaybackRequest wbRequest)
public boolean renderMementoTimemap(WaybackRequest wbRequest, javax.servlet.http.HttpServletRequest request, javax.servlet.http.HttpServletResponse response) throws WaybackException, IOException
renderMementoTimemap
in interface MementoHandler
WaybackException
IOException
public boolean handleRequest(javax.servlet.http.HttpServletRequest httpRequest, javax.servlet.http.HttpServletResponse httpResponse) throws javax.servlet.ServletException, IOException
RequestHandler
handleRequest
in interface RequestHandler
httpRequest
- the incoming HttpServletRequesthttpResponse
- the HttpServletResponse to return data to the client.javax.servlet.ServletException
- for usual reasons.IOException
- for usual reasons.public void shutdown() throws IOException
ResourceIndex
shutdown
in interface ResourceIndex
IOException
- for usual causespublic CDXServer getCdxServer()
public void setCdxServer(CDXServer cdxServer)
public int getTimestampDedupLength()
public void setTimestampDedupLength(int timestampDedupLength)
For example, with this property set to 11, {#query} will return at most only one captures within each 10 minutes span.
Non-positive value or 14 disables deduplication.
Note: deduplication is done by CDXServer
.
ZipNumIndex
also implements timestamp-deduplication, which
can be turned on by setting positive value to defaultParams.timestampDedupLength
.
It is recommended to leave this off and use this parameter only,
for several reasons:
Note now it is possible to pass collapseTime
parameter to
EmbeddedCDXServerIndex#query
, and this timestampDedupLength
parameter serves as a default, used only when collapseTime
is unspecified.
See WaybackRequest.setCollapseTime(int)
.
timestampDedupLength
- the number of digits of timestamp
used for deduplication.ZipNumParams.setTimestampDedupLength(int)
,
CDXServer.writeCdxResponse(org.archive.cdxserver.writer.CDXWriter, org.archive.util.iterator.CloseableIterator<java.lang.String>, int, org.archive.cdxserver.CDXQuery, org.archive.cdxserver.auth.AuthToken, org.archive.cdxserver.filter.CDXAccessFilter)
,
WaybackRequest.setCollapseTime(int)
public SelfRedirectFilter getSelfRedirFilter()
public void setSelfRedirFilter(SelfRedirectFilter selfRedirFilter)
public UrlCanonicalizer getCanonicalizer()
public void setCanonicalizer(UrlCanonicalizer canonicalizer)
public int getLimit()
public void setLimit(int limit)
public void addTimegateHeaders(javax.servlet.http.HttpServletResponse response, CaptureSearchResults results, WaybackRequest wbRequest, boolean includeOriginal)
addTimegateHeaders
in interface MementoHandler
public String getRemoteCdxPath()
public void setRemoteCdxPath(String remoteCdxPath)
public String getRemoteAuthCookie()
public void setRemoteAuthCookie(String remoteAuthCookie)
public String getRemoteAuthCookieIgnoreRobots()
public void setRemoteAuthCookieIgnoreRobots(String remoteAuthCookieIgnoreRobots)
public org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory getRemoteCdxHttp()
public void setRemoteCdxHttp(org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory remoteCdxHttp)
public org.archive.format.cdx.CDXInputSource getExtraSource()
public void setExtraSource(org.archive.format.cdx.CDXInputSource extraSource)
public String getPreferContains()
public void setPreferContains(String preferContains)
filename
field identifying preferred
archive among multiple copies of the same capture.preferContains
- CDXToCaptureSearchResultsWriter
public boolean isTryFuzzyMatch()
public void setTryFuzzyMatch(boolean tryFuzzyMatch)
public String getBaseStatusRegexp()
public void setBaseStatusRegexp(String baseStatusRegexp)
statuscode
field applied by default for interactive
CDX lookup (i.e. from Wayback UI, not via CDX Server API).
Value is a regular expression for status code field. Only those CDXes
with matching statuscode field will be returned. Leading/Trailing spaces are stripped off.
If value starts with "!
",
only CDXes with unmatching statuscode field will be returned
(exception: value "!" is treated as empty string, i.e. no filtering).
Value will be ignored if WybackRequest.isBestLatestReplayRequest is set, for which
hard-coded value "[23]..
" is used.
Default value is "!(500|502|504)
".
NOTE: this is a quick hack to allow for customizing replay/listing of 5xx captures. it may be replaced by different customization method, or moved to other class in the future.
baseStatusRegexp
- regular expression for status code.Copyright © 2005–2015 IIPC. All rights reserved.