Ignore:
Timestamp:
2019-10-16T20:00:09+13:00 (5 years ago)
Author:
ak19
Message:
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt

    r33568 r33569  
    1818abacre.com
    1919cn-huafu.net
     20apteka.social
     21
    2022
    2123# not product stores but autotranslated?
     
    25271videosmusica.com
    2628256file.com
    27 7773033.ru
    28 abali.ru
    29 allbeautyone.ru
     29# already in greylisting of all .ru
     30#7773033.ru
     31#abali.ru
     32#allbeautyone.ru
     33aqualuz.org
    3034
    3135# if page doesn't load and can't be tested
     
    3337www.kiterewa.pl
    3438
    35 # license plate site?
    36 eba.com.ru
     39
     40
     41# MANUALLY INSPECTED URLS AND ADDED TO GREYLIST
     42
     43# license plate site? - already in greylisting of all .ru
     44#eba.com.ru
    3745
    3846# As per archive.org, there's just a photo on the defunct page at this site
     
    4250# seems to be Indonesian or Malaysian Bible rather than in Maori or any Polynesian language
    4351alkitab.life:2022
     52
     53# appears defunct
     54alixira.com
     55
     56# single seedURL was not a page in Maori, but global languages.
     57# And the rest of the domain appears to be in English
     58anglican.org
     59
     60
     61### TLDs that we greylist - any exceptions will be in the whitelist
     62# Our list of .ru and .pl domains were not relevant
     63.ru/
     64.pl/
     65.tk/
Note: See TracChangeset for help on using the changeset viewer.