robots.txt & Sitemap Test

This test checks crawler politeness — does the scraper respect Disallow rules, discover URLs via sitemap, and honor Crawl-delay?

Test Passing Criteria

How it works

A well-behaved crawler should fetch robots.txt before crawling, respect Disallow directives, use the sitemap to discover URLs, and wait the specified Crawl-delay between requests.

robots.txt

Crawl rules and directives

/tests/robots/robots.txt

User-agent: *
Disallow: /tests/robots/private/
Disallow: /tests/robots/admin/
Crawl-delay: 2
Sitemap: /tests/robots/sitemap.xml

User-agent: BadBot
Disallow: /

sitemap.xml

URL discovery feed

/tests/robots/sitemap.xml

Contains 50 URLs: /tests/robots/page/1 through /tests/robots/page/50

Allowed Pages (1–40)

These pages return valid content and should be crawled.

Dead Pages (41–50)

These pages are listed in the sitemap but return 404. A good crawler should handle missing pages gracefully.

Page 41 (404)Page 42 (404)Page 43 (404)Page 44 (404)Page 45 (404)Page 46 (404)Page 47 (404)Page 48 (404)Page 49 (404)Page 50 (404)

Disallowed Paths

These paths are blocked via robots.txt. A well-behaved crawler must NOT visit them.

/tests/robots/private//tests/robots/admin/