Changeset 33568
- Timestamp:
- 2019-10-14T23:36:54+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection
- Files:
-
- 4 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33565 r33568 55 55 00.gs,SINGLEPAGE 56 56 57 # May be a large site 58 topographic-map.com,SINGLEPAGE 57 59 58 60 # TOP SITES -
gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt
r33559 r33568 64 64 wildsexsluts.com 65 65 xxxblacknudes.com 66 bigsexymelons.com 67 68 # more adult sites 69 acba.osb-land.com 66 70 67 71 # sounds like some pirating site 68 72 ^http://pirateguides.com/ 73 fastmp3.ru 69 74 70 75 # from alexa topsites at https://www.alexa.com/topsites … … 76 81 xhamster.com 77 82 xnxx.com 83 84 85 # not sure about the domain name and/or full url seems like it belongs here 86 abcutie.com -
gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt
r33554 r33568 8 8 9 9 10 # Product sites: unwanted auto-translation pages of online product stores 10 # Product sites: unwanted auto-translation pages of online product stores and other websites 11 11 /product/ 12 12 /products/ … … 16 16 ledpar64.china-led-lighting.com 17 17 ledwallwasher.china-led-lighting.com 18 abacre.com 19 cn-huafu.net 20 21 # not product stores but autotranslated? 22 192-168-1-1l.com 23 19216811login.club 24 19216811login.club 25 1videosmusica.com 26 256file.com 27 7773033.ru 28 abali.ru 29 allbeautyone.ru 30 31 # if page doesn't load and can't be tested 32 1videosmusica.com 33 www.kiterewa.pl 34 35 # license plate site? 36 eba.com.ru 37 38 # As per archive.org, there's just a photo on the defunct page at this site 39 # And the picture label and filename is probably Japanese 40 agri.mine.utsunomiya-u.ac.jp 41 42 # seems to be Indonesian or Malaysian Bible rather than in Maori or any Polynesian language 43 alkitab.life:2022 -
gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
r33565 r33568 53 53 * 54 54 * To run, passing the log4j and other properties files in conf/ folder: 55 * maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing warc.wet(.gz) files> <outputFolder>55 * maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing commoncrawls subfolders containing warc.wet(.gz) files> <outputFolder> 56 56 * 57 * e.g. 58 * - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 59 * - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 2>&1 | less 57 * e.g. (from maori-lang-detection/src) 58 * 59 * - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 60 * - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 2>&1 | less 60 61 * 61 62 */
Note:
See TracChangeset
for help on using the changeset viewer.