Ignore:
Timestamp:
2019-10-14T23:36:54+13:00 (5 years ago)
Author:
ak19
Message:
  1. More sites greylisted and blacklisted, discovered as I attempted to crawl them and afterwards learnt to investigate sites first. Should all .ru and .pl domains be on the greylist? 2. Adjusted instruction comments in CCWETProcessor for compiling and running
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33565 r33568  
    5353 *
    5454 * To run, passing the log4j and other properties files in conf/ folder:
    55  *      maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing warc.wet(.gz) files> <outputFolder>
     55 *      maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing commoncrawls subfolders containing warc.wet(.gz) files> <outputFolder>
    5656 *
    57  * e.g.
    58  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET
    59  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 2>&1 | less
     57 * e.g. (from maori-lang-detection/src)
     58 *   
     59 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl
     60 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 2>&1 | less
    6061 *
    6162*/
Note: See TracChangeset for help on using the changeset viewer.