robots.txt & Sitemap Test
This test checks crawler politeness — does the scraper respect Disallow rules, discover URLs via sitemap, and honor Crawl-delay?
Test Passing Criteria
How it works
A well-behaved crawler should fetch
robots.txt before crawling, respect Disallow directives, use the sitemap to discover URLs, and wait the specified Crawl-delay between requests.robots.txt
Crawl rules and directives
/tests/robots/robots.txt
User-agent: * Disallow: /tests/robots/private/ Disallow: /tests/robots/admin/ Crawl-delay: 2 Sitemap: /tests/robots/sitemap.xml User-agent: BadBot Disallow: /
sitemap.xml
URL discovery feed
/tests/robots/sitemap.xml
Contains 50 URLs: /tests/robots/page/1 through /tests/robots/page/50
Allowed Pages (1–40)
These pages return valid content and should be crawled.
Dead Pages (41–50)
These pages are listed in the sitemap but return 404. A good crawler should handle missing pages gracefully.
Page 41 (404)Page 42 (404)Page 43 (404)Page 44 (404)Page 45 (404)Page 46 (404)Page 47 (404)Page 48 (404)Page 49 (404)Page 50 (404)
Disallowed Paths
These paths are blocked via robots.txt. A well-behaved crawler must NOT visit them.
/tests/robots/private//tests/robots/admin/