public class RobotExclusionFilter extends ExclusionFilter
Modifier and Type | Field and Description |
---|---|
protected HashMap<String,Integer> |
pathsCache |
protected static String |
ROBOT_SUFFIX |
protected StringBuilder |
sb |
protected static Pattern |
WWWN_PATTERN |
protected static String |
WWWN_REGEX |
filterGroup
FILTER_ABORT, FILTER_EXCLUDE, FILTER_INCLUDE
Constructor and Description |
---|
RobotExclusionFilter(LiveWebCache webCache,
String userAgent,
long maxCacheMS)
Construct a new RobotExclusionFilter that uses webCache to pull
robots.txt documents. filtering is based on userAgent, and cached
documents newer than maxCacheMS in the webCache are considered valid.
|
Modifier and Type | Method and Description |
---|---|
int |
filterObject(CaptureSearchResult r)
inpect record and determine if it should be included in the
results or not, or if processing of new records should stop.
|
LiveWebCache |
getWebCache() |
protected String |
hostToRobotUrlString(String host,
String scheme) |
protected List<String> |
searchResultToRobotUrlStrings(String resultHost,
String scheme) |
setFilterGroup
protected static final String ROBOT_SUFFIX
protected static String WWWN_REGEX
protected static final Pattern WWWN_PATTERN
protected StringBuilder sb
public RobotExclusionFilter(LiveWebCache webCache, String userAgent, long maxCacheMS)
webCache
- LiveWebCache from which documents can be retrieveduserAgent
- String user agent to use for requests to the live web.maxCacheMS
- long number of milliseconds to cache documents in the
LiveWebCacheprotected List<String> searchResultToRobotUrlStrings(String resultHost, String scheme)
public int filterObject(CaptureSearchResult r)
ObjectFilter
r
- Object which should be checked for inclusion/exclusion or abortpublic LiveWebCache getWebCache()
Copyright © 2005–2017 IIPC. All rights reserved.