source: other-projects/maori-lang-detection/to_crawl.tar.gz@ 33904

Last change on this file since 33904 was 33904, checked in by ak19, 4 years ago

Shouldn't greylist anglican.org, as this prevented crawling of justus.anglican.org seedURLs. There's however no need to add an exception into sites-too-big-to-exhaustively-crawl.txt to control how much we crawl, as we only crawl to depth 10 anyway and the seedURLs already list the most promising pages (as well as 2 URLs on anglican.org which weren't promising). Added the to_crwal and finished crawled data for this. siteID is 01463.

  • Property svn:mime-type set to application/octet-stream
File size: 1.4 MB

HTML preview not available, since the file size exceeds 256.0 KB.Try downloading the file instead.

Note: See TracBrowser for help on using the repository browser.