public class EmbeddedCDXServerIndex extends AbstractRequestHandler implements MementoHandler, ResourceIndex
| Modifier and Type | Field and Description | 
|---|---|
protected String | 
baseStatusFilter  | 
protected String | 
baseStatusRegexp  | 
protected UrlCanonicalizer | 
canonicalizer  | 
protected CDXServer | 
cdxServer  | 
protected org.archive.format.cdx.CDXInputSource | 
extraSource  | 
protected List<String> | 
ignoreRobotPaths  | 
protected int | 
limit  | 
protected String | 
preferContains  | 
protected String | 
remoteCdxPath  | 
static String | 
REQUEST_REVISIT_LOOKUP
WaybackRequest parameter name for telling
 EmbeddedCDXServerIndex that it's looking up a specific single
 capture needed for replaying URL-agnostic revisit. | 
protected SelfRedirectFilter | 
selfRedirFilter  | 
protected int | 
timestampDedupLength  | 
protected boolean | 
tryFuzzyMatch  | 
| Constructor and Description | 
|---|
EmbeddedCDXServerIndex()  | 
| Modifier and Type | Method and Description | 
|---|---|
void | 
addTimegateHeaders(javax.servlet.http.HttpServletResponse response,
                  CaptureSearchResults results,
                  WaybackRequest wbRequest,
                  boolean includeOriginal)  | 
protected static String | 
buildStatusFilter(String regexp)  | 
protected AuthToken | 
createAuthToken(WaybackRequest wbRequest,
               String urlkey)
 robots.txt may be ignored for embedded resources (CSS, images, javascripts)
 robots.txt may be ignored if  
urlkey starts with any of ignoreRobotPaths
  | 
protected CDXQuery | 
createQuery(WaybackRequest wbRequest,
           boolean isFuzzy)
 | 
protected org.archive.util.iterator.CloseableIterator<String> | 
createRemoteIter(String urlkey,
                org.archive.util.binsearch.impl.HTTPSeekableLineReader reader)  | 
SearchResults | 
doQuery(WaybackRequest wbRequest)  | 
String | 
getBaseStatusRegexp()  | 
UrlCanonicalizer | 
getCanonicalizer()  | 
protected CDXToCaptureSearchResultsWriter | 
getCaptureSearchWriter(WaybackRequest wbRequest,
                      AuthToken waybackAuthToken,
                      boolean isFuzzy)
create  
CDXWriter for writing capture search result. | 
CDXServer | 
getCdxServer()  | 
org.archive.format.cdx.CDXInputSource | 
getExtraSource()  | 
List<String> | 
getIgnoreRobotPaths()  | 
int | 
getLimit()  | 
String | 
getPreferContains()  | 
String | 
getRemoteAuthCookie()  | 
String | 
getRemoteAuthCookieIgnoreRobots()  | 
org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory | 
getRemoteCdxHttp()  | 
String | 
getRemoteCdxPath()  | 
SelfRedirectFilter | 
getSelfRedirFilter()  | 
int | 
getTimestampDedupLength()  | 
protected CDXToSearchResultWriter | 
getUrlSearchWriter(WaybackRequest wbRequest)  | 
boolean | 
handleRequest(javax.servlet.http.HttpServletRequest httpRequest,
             javax.servlet.http.HttpServletResponse httpResponse)
Possibly handle an incoming HttpServletRequest, much like a normal
 HttpServlet, but includes a return value. 
 | 
boolean | 
isTryFuzzyMatch()  | 
protected void | 
loadWaybackCdx(String urlkey,
              WaybackRequest wbRequest,
              CDXQuery query,
              AuthToken waybackAuthToken,
              CDXToSearchResultWriter resultWriter,
              boolean fuzzy)  | 
SearchResults | 
query(WaybackRequest wbRequest)
Transform a WaybackRequest into a ResourceResults. 
 | 
protected void | 
remoteCdxServerQuery(String urlkey,
                    CDXQuery query,
                    AuthToken authToken,
                    CDXToSearchResultWriter resultWriter)  | 
boolean | 
renderMementoTimemap(WaybackRequest wbRequest,
                    javax.servlet.http.HttpServletRequest request,
                    javax.servlet.http.HttpServletResponse response)  | 
void | 
setBaseStatusRegexp(String baseStatusRegexp)
filter on  
statuscode field applied by default for interactive
 CDX lookup (i.e. from Wayback UI, not via CDX Server API). | 
void | 
setCanonicalizer(UrlCanonicalizer canonicalizer)  | 
void | 
setCdxServer(CDXServer cdxServer)  | 
void | 
setExtraSource(org.archive.format.cdx.CDXInputSource extraSource)  | 
void | 
setIgnoreRobotPaths(List<String> ignoreRobotPaths)  | 
void | 
setLimit(int limit)  | 
void | 
setPreferContains(String preferContains)
substring of  
filename field identifying preferred
 archive among multiple copies of the same capture. | 
void | 
setRemoteAuthCookie(String remoteAuthCookie)  | 
void | 
setRemoteAuthCookieIgnoreRobots(String remoteAuthCookieIgnoreRobots)  | 
void | 
setRemoteCdxHttp(org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory remoteCdxHttp)  | 
void | 
setRemoteCdxPath(String remoteCdxPath)  | 
void | 
setSelfRedirFilter(SelfRedirectFilter selfRedirFilter)  | 
void | 
setTimestampDedupLength(int timestampDedupLength)
The number of digits of timestamp used for culling (deduplicating)
 captures in CDX query result. 
 | 
void | 
setTryFuzzyMatch(boolean tryFuzzyMatch)  | 
void | 
shutdown()
Release any resources used by this ResourceIndex cleanly 
 | 
getAccessPointPath, getBeanName, getInternalPort, getMapParam, getMapParamOrEmpty, getRequiredMapParam, getServletContext, registerPortListener, setAccessPointPath, setBeanName, setInternalPort, setServletContext, translateRequestPath, translateRequestPathQueryprotected CDXServer cdxServer
protected int timestampDedupLength
protected int limit
protected UrlCanonicalizer canonicalizer
protected SelfRedirectFilter selfRedirFilter
protected String remoteCdxPath
protected org.archive.format.cdx.CDXInputSource extraSource
protected String preferContains
protected boolean tryFuzzyMatch
protected String baseStatusRegexp
protected String baseStatusFilter
public static final String REQUEST_REVISIT_LOOKUP
WaybackRequest parameter name for telling
 EmbeddedCDXServerIndex that it's looking up a specific single
 capture needed for replaying URL-agnostic revisit.
 Defined here, without setter/getter, because this is an experimental parameter supporting soft-block feature. It's very likely to change.
public SearchResults query(WaybackRequest wbRequest) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, AccessControlException
ResourceIndexquery in interface ResourceIndexwbRequest - WaybackRequest object from RequestParserResourceIndexNotAvailableException - if the ResourceIndex
                        is not available (remote host down, local files missing, etc)ResourceNotInArchiveException - if the ResourceIndex could be
                        contacted, but no SearchResult objects matched the requestBadQueryException - if the WaybackRequest is lacking information
                        required to make a reasonable search of this ResourceIndexAccessControlException - if SearchResult objects actually matched,
                        but could not be returned due to AccessControl restrictions
                        (robots.txt documents, Administrative URL blocks, etc)protected AuthToken createAuthToken(WaybackRequest wbRequest, String urlkey)
urlkey starts with any of ignoreRobotPathswbRequest - urlkey - AuthToken representing user's privileges on urlkey.public SearchResults doQuery(WaybackRequest wbRequest) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, AccessControlException
protected void loadWaybackCdx(String urlkey, WaybackRequest wbRequest, CDXQuery query, AuthToken waybackAuthToken, CDXToSearchResultWriter resultWriter, boolean fuzzy) throws IOException, AccessControlException
IOExceptionAccessControlExceptionprotected CDXQuery createQuery(WaybackRequest wbRequest, boolean isFuzzy)
CDXQuery that is sent to CDXServer.
 The query specifies standard CDX server params described at:
 https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
 Note: this method adds extra filters meant for interactive (Wayback UI)
 use. CDXServer web API should not use this method.  this method is used
 for replay and capture-search requests only.
 TODO: move this to CDXQuery as static method.wbRequest - WaybackRequest either replay or capture-queryisFuzzy - unused (?)CDXQuery object for the supplied request.protected void remoteCdxServerQuery(String urlkey, CDXQuery query, AuthToken authToken, CDXToSearchResultWriter resultWriter) throws IOException, AccessControlException
IOExceptionAccessControlExceptionprotected org.archive.util.iterator.CloseableIterator<String> createRemoteIter(String urlkey, org.archive.util.binsearch.impl.HTTPSeekableLineReader reader) throws IOException
IOExceptionprotected CDXToCaptureSearchResultsWriter getCaptureSearchWriter(WaybackRequest wbRequest, AuthToken waybackAuthToken, boolean isFuzzy)
CDXWriter for writing capture search result.
 possible future changes:
waybackAuthTokenwbRequest - WaybackRequest for configuring CDXQuerywaybackAuthToken - unusedisFuzzy - true to enable fuzzy queryprotected CDXToSearchResultWriter getUrlSearchWriter(WaybackRequest wbRequest)
public boolean renderMementoTimemap(WaybackRequest wbRequest, javax.servlet.http.HttpServletRequest request, javax.servlet.http.HttpServletResponse response) throws WaybackException, IOException
renderMementoTimemap in interface MementoHandlerWaybackExceptionIOExceptionpublic boolean handleRequest(javax.servlet.http.HttpServletRequest httpRequest,
                    javax.servlet.http.HttpServletResponse httpResponse)
                      throws javax.servlet.ServletException,
                             IOException
RequestHandlerhandleRequest in interface RequestHandlerhttpRequest - the incoming HttpServletRequesthttpResponse - the HttpServletResponse to return data to the client.javax.servlet.ServletException - for usual reasons.IOException - for usual reasons.public void shutdown()
              throws IOException
ResourceIndexshutdown in interface ResourceIndexIOException - for usual causespublic CDXServer getCdxServer()
public void setCdxServer(CDXServer cdxServer)
public int getTimestampDedupLength()
public void setTimestampDedupLength(int timestampDedupLength)
For example, with this property set to 11, {#query} will return at most only one captures within each 10 minutes span.
Non-positive value or 14 disables deduplication.
Note: deduplication is done by CDXServer.
 ZipNumIndex also implements timestamp-deduplication, which
 can be turned on by setting positive value to defaultParams.timestampDedupLength.
 It is recommended to leave this off and use this parameter only,
 for several reasons:
 
Note now it is possible to pass collapseTime parameter to
 EmbeddedCDXServerIndex#query, and this timestampDedupLength
 parameter serves as a default, used only when collapseTime
 is unspecified.
 See WaybackRequest.setCollapseTime(int).
timestampDedupLength - the number of digits of timestamp
 used for deduplication.ZipNumParams.setTimestampDedupLength(int), 
CDXServer.writeCdxResponse(org.archive.cdxserver.writer.CDXWriter, org.archive.util.iterator.CloseableIterator<java.lang.String>, int, org.archive.cdxserver.CDXQuery, org.archive.cdxserver.auth.AuthToken, org.archive.cdxserver.filter.CDXAccessFilter), 
WaybackRequest.setCollapseTime(int)public SelfRedirectFilter getSelfRedirFilter()
public void setSelfRedirFilter(SelfRedirectFilter selfRedirFilter)
public UrlCanonicalizer getCanonicalizer()
public void setCanonicalizer(UrlCanonicalizer canonicalizer)
public int getLimit()
public void setLimit(int limit)
public void addTimegateHeaders(javax.servlet.http.HttpServletResponse response,
                      CaptureSearchResults results,
                      WaybackRequest wbRequest,
                      boolean includeOriginal)
addTimegateHeaders in interface MementoHandlerpublic String getRemoteCdxPath()
public void setRemoteCdxPath(String remoteCdxPath)
public String getRemoteAuthCookie()
public void setRemoteAuthCookie(String remoteAuthCookie)
public String getRemoteAuthCookieIgnoreRobots()
public void setRemoteAuthCookieIgnoreRobots(String remoteAuthCookieIgnoreRobots)
public org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory getRemoteCdxHttp()
public void setRemoteCdxHttp(org.archive.util.binsearch.impl.HTTPSeekableLineReaderFactory remoteCdxHttp)
public org.archive.format.cdx.CDXInputSource getExtraSource()
public void setExtraSource(org.archive.format.cdx.CDXInputSource extraSource)
public String getPreferContains()
public void setPreferContains(String preferContains)
filename field identifying preferred
 archive among multiple copies of the same capture.preferContains - CDXToCaptureSearchResultsWriterpublic boolean isTryFuzzyMatch()
public void setTryFuzzyMatch(boolean tryFuzzyMatch)
public String getBaseStatusRegexp()
public void setBaseStatusRegexp(String baseStatusRegexp)
statuscode field applied by default for interactive
 CDX lookup (i.e. from Wayback UI, not via CDX Server API).
 Value is a regular expression for status code field. Only those CDXes
 with matching statuscode field will be returned. Leading/Trailing spaces are stripped off.
 If value starts with "!",
 only CDXes with unmatching statuscode field will be returned
 (exception: value "!" is treated as empty string, i.e. no filtering).
Value will be ignored if WybackRequest.isBestLatestReplayRequest is set, for which
 hard-coded value "[23].." is used.
Default value is "!(500|502|504)".
NOTE: this is a quick hack to allow for customizing replay/listing of 5xx captures. it may be replaced by different customization method, or moved to other class in the future.
baseStatusRegexp - regular expression for status code.Copyright © 2005–2015 IIPC. All rights reserved.