Changeset 33568


Ignore:
Timestamp:
2019-10-14T23:36:54+13:00 (5 years ago)
Author:
ak19
Message:
  1. More sites greylisted and blacklisted, discovered as I attempted to crawl them and afterwards learnt to investigate sites first. Should all .ru and .pl domains be on the greylist? 2. Adjusted instruction comments in CCWETProcessor for compiling and running
Location:
gs3-extensions/maori-lang-detection
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33565 r33568  
    555500.gs,SINGLEPAGE
    5656
     57# May be a large site
     58topographic-map.com,SINGLEPAGE
    5759
    5860# TOP SITES
  • gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt

    r33559 r33568  
    6464wildsexsluts.com
    6565xxxblacknudes.com
     66bigsexymelons.com
     67
     68# more adult sites
     69acba.osb-land.com
    6670
    6771# sounds like some pirating site
    6872^http://pirateguides.com/
     73fastmp3.ru
    6974
    7075# from alexa topsites at https://www.alexa.com/topsites
     
    7681xhamster.com
    7782xnxx.com
     83
     84
     85# not sure about the domain name and/or full url seems like it belongs here
     86abcutie.com
  • gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt

    r33554 r33568  
    88
    99
    10 # Product sites: unwanted auto-translation pages of online product stores
     10# Product sites: unwanted auto-translation pages of online product stores and other websites
    1111/product/
    1212/products/
     
    1616ledpar64.china-led-lighting.com
    1717ledwallwasher.china-led-lighting.com
     18abacre.com
     19cn-huafu.net
     20
     21# not product stores but autotranslated?
     22192-168-1-1l.com
     2319216811login.club
     2419216811login.club
     251videosmusica.com
     26256file.com
     277773033.ru
     28abali.ru
     29allbeautyone.ru
     30
     31# if page doesn't load and can't be tested
     321videosmusica.com
     33www.kiterewa.pl
     34
     35# license plate site?
     36eba.com.ru
     37
     38# As per archive.org, there's just a photo on the defunct page at this site
     39# And the picture label and filename is probably Japanese
     40agri.mine.utsunomiya-u.ac.jp
     41
     42# seems to be Indonesian or Malaysian Bible rather than in Maori or any Polynesian language
     43alkitab.life:2022
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33565 r33568  
    5353 *
    5454 * To run, passing the log4j and other properties files in conf/ folder:
    55  *      maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing warc.wet(.gz) files> <outputFolder>
     55 *      maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing commoncrawls subfolders containing warc.wet(.gz) files> <outputFolder>
    5656 *
    57  * e.g.
    58  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET
    59  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 2>&1 | less
     57 * e.g. (from maori-lang-detection/src)
     58 *   
     59 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl
     60 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 2>&1 | less
    6061 *
    6162*/
Note: See TracChangeset for help on using the changeset viewer.