public class RobotRulesTest
extends junit.framework.TestCase
RobotRules
References:
Note: GYBA convention document and RFC differs in terminology. What GYBA calls "group" is called "record" in RFC.
| Constructor and Description |
|---|
RobotRulesTest() |
| Modifier and Type | Method and Description |
|---|---|
protected RobotRules |
load(String txt) |
void |
setUp() |
void |
testBlankLineInGroup()
this is a syntax error per RFC, but ok with Google/Yahoo/Bing/Ask convention.
|
void |
testComments() |
void |
testDirectivesAreCaseInsensitive()
basic test
|
void |
testEmptyDisallowHasNoEffect()
Disallow: with empty path has no effect.
|
void |
testEndOfPath()
Google/Bing/Yahoo/Ask extension:
$ matches the end of path. |
void |
testEOLs()
LF, CRLF, CR are recognized as end-of-line.
|
void |
testExtraSpace2()
white spaces are allowed at the beginning of the line, too.
|
void |
testLessSpaceExtraSpace()
optional white spaces before/after "
:", before EOL. |
void |
testMostSpecificPathPrevails()
By GYBA convention, if multiple
disallow and allow directives matches the URL, the most specific
rule based on the length of the path will win over less specific
(shorter) ones.
|
void |
testMultiUA()
multiple User-agent: for a record.
|
void |
testMultiUAWithOtherLinesLine()
there's sitemap (non-allow/disallow) directive after User-agent.
|
void |
testMultiUAWithOtherLinesLine2()
similarly to previous test case,
Crawl-delay: line shall
end the group. |
void |
testNonBlocksPathForUA()
multiple records for different User-agent's.
|
void |
testPercentEncodedPath()
character may be %-escaped, but %2f (
/) is special. |
void |
testSubpath()
path matching basics. substring-based,
/ is no special,
and case-sensitive. |
void |
testUserAgentIsCaseInsensitive()
user-agent name comparisons are case-insensitive.
|
void |
testWildcardMatch()
wildcard in path, matches any chars including
/. |
countTestCases, createResult, getName, run, run, runBare, runTest, setName, tearDown, toStringassertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertFalse, assertFalse, assertNotNull, assertNotNull, assertNotSame, assertNotSame, assertNull, assertNull, assertSame, assertSame, assertTrue, assertTrue, fail, failpublic static final String WB_UA
public void setUp()
setUp in class junit.framework.TestCaseprotected RobotRules load(String txt) throws IOException
IOExceptionpublic void testDirectivesAreCaseInsensitive()
throws Exception
Exceptionpublic void testEmptyDisallowHasNoEffect()
throws Exception
Exceptionpublic void testLessSpaceExtraSpace()
throws Exception
:", before EOL.Exceptionpublic void testExtraSpace2()
throws Exception
Exceptionpublic void testEOLs()
throws Exception
Exceptionpublic void testUserAgentIsCaseInsensitive()
throws Exception
Exceptionpublic void testNonBlocksPathForUA()
throws Exception
while RFC states "the format logically consists of a non-empty set or records, separated by blank lines", Google's documentation has no mention to blank lines as group separator - instead, it recognizes a sequence of User-agent: as the start of "group". So this sample is syntax error per RFC, but okay according to Google/Yahoo/Bing/Ask convention.
Exceptionpublic void testMultiUAWithOtherLinesLine()
throws Exception
Exceptionpublic void testMultiUAWithOtherLinesLine2()
throws Exception
Crawl-delay: line shall
end the group.Exceptionpublic void testBlankLineInGroup()
throws Exception
Exceptionpublic void testMultiUA()
throws Exception
Exceptionpublic void testSubpath()
throws Exception
/ is no special,
and case-sensitive.Exceptionpublic void testPercentEncodedPath()
throws Exception
/) is special.
(TODO: additional tests: robots.txt is assumed to be UTF-8 encoded.
non-7bit-ascii characters are allowed, and also can be %-escaped.)Exceptionpublic void testMostSpecificPathPrevails()
throws Exception
By GYBA convention, if multiple disallow and allow directives matches the URL, the most specific rule based on the length of the path will win over less specific (shorter) ones.
RFC says differently: "a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used."
we follow GYBA convention here.
Exceptionpublic void testWildcardMatch()
throws Exception
/.
/* is the same as /.ExceptionCopyright © 2005–2015 IIPC. All rights reserved.