robots.txt & Sitemap Test

This test checks crawler politeness — does the scraper respect Disallow rules, discover URLs via sitemap, and honor Crawl-delay?

Test Passing Criteria
robots.txt
Crawl rules and directives
/tests/robots/robots.txt
User-agent: *
Disallow: /tests/robots/private/
Disallow: /tests/robots/admin/
Crawl-delay: 2
Sitemap: /tests/robots/sitemap.xml

User-agent: BadBot
Disallow: /
sitemap.xml
URL discovery feed
/tests/robots/sitemap.xml

Contains 50 URLs: /tests/robots/page/1 through /tests/robots/page/50

Dead Pages (41–50)
These pages are listed in the sitemap but return 404. A good crawler should handle missing pages gracefully.
Page 41 (404)Page 42 (404)Page 43 (404)Page 44 (404)Page 45 (404)Page 46 (404)Page 47 (404)Page 48 (404)Page 49 (404)Page 50 (404)
Disallowed Paths
These paths are blocked via robots.txt. A well-behaved crawler must NOT visit them.
/tests/robots/private//tests/robots/admin/

robots.txt & Sitemap Test — Part of the ScrapeMe test suite.