Ignore:
Timestamp:
2020-02-05T18:48:33+13:00 (4 years ago)
Author:
ak19
Message:

Shouldn't greylist anglican.org, as this prevented crawling of justus.anglican.org seedURLs. There's however no need to add an exception into sites-too-big-to-exhaustively-crawl.txt to control how much we crawl, as we only crawl to depth 10 anyway and the seedURLs already list the most promising pages (as well as 2 URLs on anglican.org which weren't promising). Added the to_crwal and finished crawled data for this. siteID is 01463.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/conf/url-greylist-filter.txt

    r33569 r33904  
    5555
    5656# single seedURL was not a page in Maori, but global languages.
    57 # And the rest of the domain appears to be in English
    58 anglican.org
     57# And the rest of the domain appears to be in English.
     58#anglican.org
     59# but we want the seedURLs from justus.anglican.org,
     60# so grab anglican.org anyway
    5961
    6062
Note: See TracChangeset for help on using the changeset viewer.