Last change
on this file since 33588 was 33569, checked in by ak19, 5 years ago |
- batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
|
File size:
762 bytes
|
Line | |
---|
1 | # URL 'whitelist': urls of these forms go into the keep pile.
|
---|
2 | # whitelist overrides blacklist and greylist.
|
---|
3 | # FORMAT:
|
---|
4 | # precede URL by ^ to greylist urls that match the given prefix
|
---|
5 | # succeed URL by $ to greylist urls that match the given suffix
|
---|
6 | # ^url$ will greylist urls that match the given url completely
|
---|
7 | # Without either ^ or $ symbol, urls containing the given url will get greylisted
|
---|
8 |
|
---|
9 | # Special exception for this url on yale.edu, since we needed to blacklist
|
---|
10 | # some particular other urls on yale.edu
|
---|
11 | http://korora.econ.yale.edu/phillips/archive/hauraki.htm
|
---|
12 |
|
---|
13 | # We've added .ru$ sites to the blacklist, but the following
|
---|
14 | # Russian website contains actual Maori language content
|
---|
15 | http://www.krassotkin.ru/sites/prayer.su/maori/
|
---|
16 | https://mi.centr-zashity.ru/
|
---|
Note:
See
TracBrowser
for help on using the repository browser.