source: gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt@ 33588

Last change on this file since 33588 was 33569, checked in by ak19, 5 years ago
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File size: 762 bytes
Line 
1# URL 'whitelist': urls of these forms go into the keep pile.
2# whitelist overrides blacklist and greylist.
3# FORMAT:
4# precede URL by ^ to greylist urls that match the given prefix
5# succeed URL by $ to greylist urls that match the given suffix
6# ^url$ will greylist urls that match the given url completely
7# Without either ^ or $ symbol, urls containing the given url will get greylisted
8
9# Special exception for this url on yale.edu, since we needed to blacklist
10# some particular other urls on yale.edu
11http://korora.econ.yale.edu/phillips/archive/hauraki.htm
12
13# We've added .ru$ sites to the blacklist, but the following
14# Russian website contains actual Maori language content
15http://www.krassotkin.ru/sites/prayer.su/maori/
16https://mi.centr-zashity.ru/
Note: See TracBrowser for help on using the repository browser.