public class RobotRulesTest
extends junit.framework.TestCase
RobotRules
References:
Note: GYBA convention document and RFC differs in terminology. What GYBA calls "group" is called "record" in RFC.
Constructor and Description |
---|
RobotRulesTest() |
Modifier and Type | Method and Description |
---|---|
protected RobotRules |
load(String txt) |
void |
setUp() |
void |
testBlankLineInGroup()
this is a syntax error per RFC, but ok with Google/Yahoo/Bing/Ask convention.
|
void |
testComments() |
void |
testDirectivesAreCaseInsensitive()
basic test
|
void |
testEmptyDisallowHasNoEffect()
Disallow: with empty path has no effect.
|
void |
testEndOfPath()
Google/Bing/Yahoo/Ask extension:
$ matches the end of path. |
void |
testEOLs()
LF, CRLF, CR are recognized as end-of-line.
|
void |
testExtraSpace2()
white spaces are allowed at the beginning of the line, too.
|
void |
testLessSpaceExtraSpace()
optional white spaces before/after "
: ", before EOL. |
void |
testMostSpecificPathPrevails()
By GYBA convention, if multiple
disallow and allow directives matches the URL, the most specific
rule based on the length of the path will win over less specific
(shorter) ones.
|
void |
testMultiUA()
multiple User-agent: for a record.
|
void |
testMultiUAWithOtherLinesLine()
there's sitemap (non-allow/disallow) directive after User-agent.
|
void |
testMultiUAWithOtherLinesLine2()
similarly to previous test case,
Crawl-delay: line shall
end the group. |
void |
testNonBlocksPathForUA()
multiple records for different User-agent's.
|
void |
testPercentEncodedPath()
character may be %-escaped, but %2f (
/ ) is special. |
void |
testSubpath()
path matching basics. substring-based,
/ is no special,
and case-sensitive. |
void |
testUserAgentIsCaseInsensitive()
user-agent name comparisons are case-insensitive.
|
void |
testWildcardMatch()
wildcard in path, matches any chars including
/ . |
countTestCases, createResult, getName, run, run, runBare, runTest, setName, tearDown, toString
assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertFalse, assertFalse, assertNotNull, assertNotNull, assertNotSame, assertNotSame, assertNull, assertNull, assertSame, assertSame, assertTrue, assertTrue, fail, fail
public static final String WB_UA
public void setUp()
setUp
in class junit.framework.TestCase
protected RobotRules load(String txt) throws IOException
IOException
public void testDirectivesAreCaseInsensitive() throws Exception
Exception
public void testEmptyDisallowHasNoEffect() throws Exception
Exception
public void testLessSpaceExtraSpace() throws Exception
:
", before EOL.Exception
public void testExtraSpace2() throws Exception
Exception
public void testEOLs() throws Exception
Exception
public void testUserAgentIsCaseInsensitive() throws Exception
Exception
public void testNonBlocksPathForUA() throws Exception
while RFC states "the format logically consists of a non-empty set or records, separated by blank lines", Google's documentation has no mention to blank lines as group separator - instead, it recognizes a sequence of User-agent: as the start of "group". So this sample is syntax error per RFC, but okay according to Google/Yahoo/Bing/Ask convention.
Exception
public void testMultiUAWithOtherLinesLine() throws Exception
Exception
public void testMultiUAWithOtherLinesLine2() throws Exception
Crawl-delay:
line shall
end the group.Exception
public void testBlankLineInGroup() throws Exception
Exception
public void testMultiUA() throws Exception
Exception
public void testSubpath() throws Exception
/
is no special,
and case-sensitive.Exception
public void testPercentEncodedPath() throws Exception
/
) is special.
(TODO: additional tests: robots.txt is assumed to be UTF-8 encoded.
non-7bit-ascii characters are allowed, and also can be %-escaped.)Exception
public void testMostSpecificPathPrevails() throws Exception
By GYBA convention, if multiple disallow and allow directives matches the URL, the most specific rule based on the length of the path will win over less specific (shorter) ones.
RFC says differently: "a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used."
we follow GYBA convention here.
Exception
public void testWildcardMatch() throws Exception
/
.
/*
is the same as /
.Exception
Copyright © 2005–2015 IIPC. All rights reserved.