Changeset 33904

Timestamp:
2020-02-05T18:48:33+13:00 (4 years ago)
Author:
ak19
Message:

Shouldn't greylist anglican.org, as this prevented crawling of justus.anglican.org seedURLs. There's however no need to add an exception into sites-too-big-to-exhaustively-crawl.txt to control how much we crawl, as we only crawl to depth 10 anyway and the seedURLs already list the most promising pages (as well as 2 URLs on anglican.org which weren't promising). Added the to_crwal and finished crawled data for this. siteID is 01463.

Location:
other-projects/maori-lang-detection
Files:
4 edited

Changeset view not shown, since the total size (213.5 MB) exceeds 9.5 MB

Note: See TracChangeset for help on using the changeset viewer.