Changeset 33568

Show
Ignore:
Timestamp:
14.10.2019 23:36:54 (4 weeks ago)
Author:
ak19
Message:

1. More sites greylisted and blacklisted, discovered as I attempted to crawl them and afterwards learnt to investigate sites first. Should all .ru and .pl domains be on the greylist? 2. Adjusted instruction comments in CCWETProcessor for compiling and running

Location:
gs3-extensions/maori-lang-detection
Files:
4 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33565 r33568  
    555500.gs,SINGLEPAGE 
    5656 
     57# May be a large site 
     58topographic-map.com,SINGLEPAGE 
    5759 
    5860# TOP SITES 
  • gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt

    r33559 r33568  
    6464wildsexsluts.com 
    6565xxxblacknudes.com 
     66bigsexymelons.com 
     67 
     68# more adult sites 
     69acba.osb-land.com 
    6670 
    6771# sounds like some pirating site 
    6872^http://pirateguides.com/ 
     73fastmp3.ru 
    6974 
    7075# from alexa topsites at https://www.alexa.com/topsites 
     
    7681xhamster.com 
    7782xnxx.com 
     83 
     84 
     85# not sure about the domain name and/or full url seems like it belongs here 
     86abcutie.com 
  • gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt

    r33554 r33568  
    88 
    99 
    10 # Product sites: unwanted auto-translation pages of online product stores 
     10# Product sites: unwanted auto-translation pages of online product stores and other websites 
    1111/product/ 
    1212/products/ 
     
    1616ledpar64.china-led-lighting.com 
    1717ledwallwasher.china-led-lighting.com 
     18abacre.com 
     19cn-huafu.net 
     20 
     21# not product stores but autotranslated? 
     22192-168-1-1l.com 
     2319216811login.club 
     2419216811login.club 
     251videosmusica.com 
     26256file.com 
     277773033.ru 
     28abali.ru 
     29allbeautyone.ru 
     30 
     31# if page doesn't load and can't be tested 
     321videosmusica.com 
     33www.kiterewa.pl 
     34 
     35# license plate site? 
     36eba.com.ru 
     37 
     38# As per archive.org, there's just a photo on the defunct page at this site 
     39# And the picture label and filename is probably Japanese 
     40agri.mine.utsunomiya-u.ac.jp 
     41 
     42# seems to be Indonesian or Malaysian Bible rather than in Maori or any Polynesian language 
     43alkitab.life:2022 
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33565 r33568  
    5353 * 
    5454 * To run, passing the log4j and other properties files in conf/ folder: 
    55  *      maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing warc.wet(.gz) files> <outputFolder> 
     55 *      maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing commoncrawls subfolders containing warc.wet(.gz) files> <outputFolder> 
    5656 * 
    57  * e.g. 
    58  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 
    59  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 2>&1 | less 
     57 * e.g. (from maori-lang-detection/src) 
     58 *     
     59 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 
     60 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 2>&1 | less 
    6061 * 
    6162*/