# # ChangeLog for gs3-extensions # # Generated by Trac 1.4.2 # 2024-04-19T18:45:40+12:00 Mon, 02 Dec 2019 00:54:11 GMT kjdon [33736] * gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm (modified) fixed a spelling mistake Sun, 10 Nov 2019 20:38:55 GMT ak19 [33635] * other-projects/maori-lang-detection (moved) Maori-language-detection doesn't use Greenstone 3 at present, it's ... Fri, 08 Nov 2019 10:59:07 GMT ak19 [33634] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToCSV.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/SentenceInfo.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextLanguageDetector.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WebpageInfo.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WebsiteInfo.java (added) Rewrote NutchTextDumpProcessor as NutchTextDumpToMongoDB.java, which ... Fri, 08 Nov 2019 06:43:39 GMT ak19 [33633] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToCSV.java (moved) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextLanguageDetector.java (modified) 1. TextLanguageDetector now has methods for collecting all sentences ... Tue, 05 Nov 2019 08:59:46 GMT ak19 [33626] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java (modified) TODOs Tue, 05 Nov 2019 08:58:44 GMT ak19 [33625] * gs3-extensions/maori-lang-detection/conf/keep-since-not-product-sites.txt (added) * gs3-extensions/maori-lang-detection/conf/possible-product-sites.txt (added) A file listing domains with seedurls containing /mi(/) that are ... Tue, 05 Nov 2019 08:48:50 GMT ak19 [33624] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Some cleanup surrounding the now renamed function createSeedURLsFile, ... Tue, 05 Nov 2019 08:04:09 GMT ak19 [33623] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/conf/config.properties (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java (modified) 1. Incorporated Dr Nichols earlier suggestion of storing page ... Tue, 05 Nov 2019 02:42:46 GMT ak19 [33622] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java (moved) File rename Mon, 04 Nov 2019 07:35:59 GMT ak19 [33621] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Comitting jotted down mongodb related instructions from what Dr ... Mon, 04 Nov 2019 01:24:25 GMT ak19 [33620] * gs3-extensions/maori-lang-detection/crawledNode6.tar (added) Final crawl, done on vagrant VM node6. Crawl site IDs 01407-01462. Fri, 01 Nov 2019 07:14:18 GMT ak19 [33618] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) Adding in the download URL Fri, 01 Nov 2019 04:13:18 GMT ak19 [33617] * gs3-extensions/maori-lang-detection/crawledNode5.tar (modified) Node5 is now full and here is the finished crawl (up to and including ... Thu, 31 Oct 2019 07:05:07 GMT ak19 [33616] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MongoDBConnection.java (added) Beginnings of Java class that is to interact with MongoDB. I don't ... Thu, 31 Oct 2019 07:03:55 GMT ak19 [33615] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/conf/config.properties (modified) * gs3-extensions/maori-lang-detection/conf/log4j.properties (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) 1. Worked out how to configure log4j to log both to console and ... Wed, 30 Oct 2019 10:03:19 GMT ak19 [33609] * gs3-extensions/maori-lang-detection/crawledNode2.tar (moved) * gs3-extensions/maori-lang-detection/crawledNode3.tar (moved) * gs3-extensions/maori-lang-detection/crawledNode4.tar (moved) * gs3-extensions/maori-lang-detection/crawledNode5.tar (added) The tar files containing the crawled sites data shouldn't be called ... Wed, 30 Oct 2019 10:02:26 GMT ak19 [33608] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/exportHBase.sh (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) 1. New script to export from HBase so that we could in theory ... Tue, 29 Oct 2019 05:33:49 GMT ak19 [33607] * gs3-extensions/maori-lang-detection/crawledNode4.tar.gz (modified) Updated with the remaining successfully crawled sites on node4 before ... Tue, 29 Oct 2019 02:18:51 GMT ak19 [33606] * gs3-extensions/maori-lang-detection/crawledNode2.tar.gz (moved) * gs3-extensions/maori-lang-detection/crawledNode3.tar.gz (added) 1. Committing crawl data from node3 (2nd VM for nutch crawling). 2. ... Tue, 29 Oct 2019 01:54:24 GMT ak19 [33605] * gs3-extensions/maori-lang-detection/crawledNode4.tar.gz (added) Node 4 VM still works, but committing first set of crawled sites on there Thu, 24 Oct 2019 10:22:30 GMT ak19 [33604] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java (modified) 1. Better output into possible-product-sites.txt including the ... Thu, 24 Oct 2019 09:04:37 GMT ak19 [33603] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/conf/GeoLiteCity.dat (added) * gs3-extensions/maori-lang-detection/lib/geoip-api-1.2.10.jar (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java (modified) Incorporating Dr Nichols suggestion to help weed out product sites: ... Wed, 23 Oct 2019 10:49:34 GMT ak19 [33602] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MRIWebPageStats.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) 1. The final csv file, mri-sentences.csv, is now written out. 2. Only ... Wed, 23 Oct 2019 10:22:14 GMT ak19 [33601] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) Creates the 2nd csv file, with info about webpages. At present stores ... Wed, 23 Oct 2019 10:05:38 GMT ak19 [33600] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MRIWebPageStats.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) Work in progress of writing out CSV files. In future, may write the ... Tue, 22 Oct 2019 07:49:48 GMT ak19 [33599] * gs3-extensions/maori-lang-detection/crawled-1-of-3.tar.gz (added) First one-third sites crawled. Committing to SVN despite the tarred ... Tue, 22 Oct 2019 07:19:54 GMT ak19 [33598] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz (modified) More instructions on setting up Nutch now that I've remembered to ... Tue, 22 Oct 2019 07:05:50 GMT ak19 [33597] * gs3-extensions/maori-lang-detection/hdfs-cc-work/conf/regex-urlfilter.GS_TEMPLATE (modified) Committing active version of template file which has a newline at end ... Tue, 22 Oct 2019 05:44:05 GMT ak19 [33596] * gs3-extensions/maori-lang-detection/hdfs-cc-work/conf/nutch-site.xml (added) * gs3-extensions/maori-lang-detection/hdfs-cc-work/conf/regex-urlfilter.GS_TEMPLATE (added) Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template ... Fri, 18 Oct 2019 10:20:09 GMT ak19 [33588] * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts/mri-sent_trained.bin (modified) Committing the MRI sentence model that I'm actually using, the one in ... Fri, 18 Oct 2019 10:16:25 GMT ak19 [33587] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MRIWebPageStats.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextLanguageDetector.java (modified) 1. Better stats reporting on crawled sites: not just if a page was in ... Fri, 18 Oct 2019 09:20:06 GMT ak19 [33586] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextLanguageDetector.java (added) Refactored MaoriTextDetector.java class into more general ... Fri, 18 Oct 2019 08:41:32 GMT ak19 [33585] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) Much simpler way of using sentence and language detection model to ... Fri, 18 Oct 2019 08:20:39 GMT ak19 [33584] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) Committing experimental version 2 using the sentence detector model, ... Fri, 18 Oct 2019 08:20:18 GMT ak19 [33583] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) Committing experimental version 1 using the sentence detector model, ... Thu, 17 Oct 2019 10:12:38 GMT ak19 [33582] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MRIWebPageStats.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) NutchTextDumpProcessor prints each crawled site's stats: number of ... Thu, 17 Oct 2019 08:53:20 GMT ak19 [33581] * gs3-extensions/maori-lang-detection/bin/script/gen_SentenceDetection_model.sh (modified) Minor fix. Noticed when looking for work I did on MRI sentence detection Thu, 17 Oct 2019 08:44:46 GMT ak19 [33580] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) Finally fixed the thus-far identified bugs when parsing dump.txt. Thu, 17 Oct 2019 08:05:21 GMT ak19 [33579] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) Debugging. Solved one problem. Thu, 17 Oct 2019 06:31:53 GMT ak19 [33578] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (modified) Corrections for compiling the 2 new classes. Thu, 17 Oct 2019 06:12:15 GMT ak19 [33577] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (modified) Forgot to adjust usage statement to say that silent mode was already ... Wed, 16 Oct 2019 10:37:41 GMT ak19 [33576] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java (added) Introducing 2 new Java files still being written and untested. ... Wed, 16 Oct 2019 10:36:20 GMT ak19 [33575] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Correcting usage string for CCWETProcessor before committing new java ... Wed, 16 Oct 2019 10:35:45 GMT ak19 [33574] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) If nutch stores a crawled site in more than 1 file, then cat all of ... Wed, 16 Oct 2019 08:39:56 GMT ak19 [33573] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Forgot to document that spaces were also allowed as separator in the ... Wed, 16 Oct 2019 08:18:38 GMT ak19 [33572] * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102618-000000.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102621-000001.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000002.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000003.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103611-000004.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103613-000005.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000006.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000007.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104558-000009.warc.wet (deleted) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104559-000008.warc.wet (deleted) Only meant to store the wet.gz versions of these files, not also the ... Wed, 16 Oct 2019 08:11:26 GMT ak19 [33571] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) Adding Dr Bainbridge's suggestion of appending the crawlId of each ... Wed, 16 Oct 2019 07:04:44 GMT ak19 [33570] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) Need to check if UNFINISHED file actually exists before moving it ... Wed, 16 Oct 2019 07:00:09 GMT ak19 [33569] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) 1. batchcrawl.sh now does what it should have from the start, which ... Mon, 14 Oct 2019 10:36:54 GMT ak19 [33568] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) 1. More sites greylisted and blacklisted, discovered as I attempted ... Mon, 14 Oct 2019 09:40:22 GMT ak19 [33567] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) batchcrawl.sh now supports -all flag (and prints usage on 0 args). ... Mon, 14 Oct 2019 09:07:45 GMT ak19 [33566] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) batchcrawl.sh script now supports taking a comma or space separated ... Mon, 14 Oct 2019 08:04:58 GMT ak19 [33565] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) CCWETProcessor: domain url now goes in as a seedURL after the ... Mon, 14 Oct 2019 08:01:17 GMT ak19 [33564] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) batchcrawl.sh now does the crawl and logs output of the crawl, dumps ... Fri, 11 Oct 2019 10:29:40 GMT ak19 [33563] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (added) Committing inactive testing batch scripts (only creates the regex- ... Fri, 11 Oct 2019 08:52:40 GMT ak19 [33562] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/lib/LICENSE.txt (added) * gs3-extensions/maori-lang-detection/lib/NOTICE.txt (added) * gs3-extensions/maori-lang-detection/lib/commons-csv-1.7.jar (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) 1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a ... Fri, 11 Oct 2019 07:49:05 GMT ak19 [33561] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) 1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated ... Thu, 10 Oct 2019 10:49:58 GMT ak19 [33560] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) 1. Incorporated Dr Bainbridge's suggested improvements: only when ... Thu, 10 Oct 2019 10:44:31 GMT ak19 [33559] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt (modified) 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge ... Thu, 10 Oct 2019 10:41:36 GMT ak19 [33558] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Committing cumulative changes since last commit. Wed, 09 Oct 2019 10:10:06 GMT ak19 [33557] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Implemented the topSitesMap of topsite domain to url pattern in the ... Wed, 09 Oct 2019 05:58:30 GMT ak19 [33556] * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) Blacklisted wikipedia pages that are actually in other languages ... Wed, 09 Oct 2019 05:43:47 GMT ak19 [33555] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) Modified top sites list as Dr Bainbridge described: suffixes for the ... Wed, 09 Oct 2019 05:11:19 GMT ak19 [33554] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) Added more to blacklist and greylist. And removed remaining ... Fri, 04 Oct 2019 09:19:20 GMT ak19 [33553] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) Comments Fri, 04 Oct 2019 09:00:46 GMT ak19 [33552] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) 1. Code now processes ccrawldata folder, containing each individual ... Fri, 04 Oct 2019 06:35:06 GMT ak19 [33551] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) Added in top 500 urls from moz.com/top500 and removed duplicates, and ... Fri, 04 Oct 2019 06:06:51 GMT ak19 [33550] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (added) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) First stage of introducing sites-too-big-to-exhaustively-crawl.tx: ... Fri, 04 Oct 2019 05:29:50 GMT ak19 [33549] * gs3-extensions/maori-lang-detection/ccrawl-data (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135334-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135335-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135533-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135534-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135731-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135732-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135930-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135930-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926140130-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926140132-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927111950-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927111952-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112247-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112247-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112539-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112540-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112830-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112832-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927113121-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927113122-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930134759-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930134801-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135217-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135218-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135634-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135637-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140053-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140056-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140510-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140512-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112358-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112358-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112629-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112631-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112900-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112901-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113130-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113131-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113401-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113401-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085129-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085129-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085435-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085437-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085739-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085740-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090041-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090044-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090347-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090348-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924031741-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924031742-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032031-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032034-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032319-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032319-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032606-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032607-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032851-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032854-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923212744-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923212748-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213222-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213227-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213659-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213702-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214137-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214138-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214614-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214616-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923161945-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923161945-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162223-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162223-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162500-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162502-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162737-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162739-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923163013-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923163015-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094332-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094332-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094842-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094845-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095357-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095358-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095911-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095912-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923100426-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923100427-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035248-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035249-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035802-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035802-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040326-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040331-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040848-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040849-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923041403-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923041404-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100141-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100451-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100453-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100805-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100809-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101119-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101119-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101429-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101429-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102618-000000.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102618-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102621-000001.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102621-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000002.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000003.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103611-000004.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103611-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103613-000005.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103613-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000006.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000007.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104558-000009.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104558-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104559-000008.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104559-000008.warc.wet.gz (added) All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 ... Fri, 04 Oct 2019 01:36:53 GMT davidb [33548] * gs3-extensions/web-audio/trunk/INSTALL.sh (modified) Include new wavesurfer sub-project to install Fri, 04 Oct 2019 01:19:51 GMT davidb [33546] * gs3-extensions/web-audio/trunk/wavesurfer (added) * gs3-extensions/web-audio/trunk/wavesurfer/INSTALL.sh (added) * gs3-extensions/web-audio/trunk/wavesurfer/css (added) * gs3-extensions/web-audio/trunk/wavesurfer/css/ribbon.css (added) * gs3-extensions/web-audio/trunk/wavesurfer/css/style.css (added) * gs3-extensions/web-audio/trunk/wavesurfer/devel (added) * gs3-extensions/web-audio/trunk/wavesurfer/devel/node-v10.16.3-darwin-x64.tar.gz (added) * gs3-extensions/web-audio/trunk/wavesurfer/src (added) * gs3-extensions/web-audio/trunk/wavesurfer/src/wavesurfer.js-2.2.1.tar.gz (added) * gs3-extensions/web-audio/trunk/wavesurfer/wavesurfer-player.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.min.js.map (added) Initial cut at wave-surfer based JS audio player extension for Greenstone Thu, 03 Oct 2019 09:38:00 GMT ak19 [33545] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) Mainly changes to crawling-Nutch.txt and some minor changes to other ... Wed, 02 Oct 2019 04:01:47 GMT ak19 [33543] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz (modified) Filled in some missing instructions Tue, 01 Oct 2019 09:27:03 GMT ak19 [33541] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/patches/GZRangeClient.java (added) * gs3-extensions/maori-lang-detection/hdfs-cc-work/patches/WATExtractorOutput.java (added) 1. hdfs-cc-work/GS_README.txt now contains the complete instructions ... Tue, 01 Oct 2019 08:40:33 GMT ak19 [33540] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Since I wasn't getting further with nutch 2 to grab an entire site, I ... Tue, 01 Oct 2019 08:36:38 GMT ak19 [33539] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (moved) File rename Tue, 01 Oct 2019 08:36:06 GMT ak19 [33538] * gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/setup.sh (modified) Some additions to the setup.sh script to query commoncrawl for MRI ... Mon, 30 Sep 2019 09:51:36 GMT ak19 [33537] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) More nutch and general site mirroring related links Mon, 30 Sep 2019 08:28:38 GMT ak19 [33536] * gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz (added) Changes required to the commoncrawl related Vagrant github project to ... Mon, 30 Sep 2019 03:49:19 GMT ak19 [33535] * gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/setup.sh (added) 1. New setup.sh script for on a hadoop system to setup the git ... Fri, 27 Sep 2019 05:05:40 GMT ak19 [33534] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh (modified) Correction: toplevel script has to be placed inside cc-index-table ... Thu, 26 Sep 2019 11:06:11 GMT ak19 [33532] * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) Found the other top 500 sites link again at last which Dr Bainbridge ... Thu, 26 Sep 2019 11:03:01 GMT ak19 [33531] * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt (added) Added whitelist for mi.wikipedia.org, and updates to blacklist and ... Thu, 26 Sep 2019 10:41:56 GMT ak19 [33530] * gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt (modified) Completed sentence that was left hanging. Thu, 26 Sep 2019 10:22:07 GMT ak19 [33529] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Forgot to add most basic nutch links Thu, 26 Sep 2019 09:47:13 GMT ak19 [33528] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (moved) Adding in Nutch links Thu, 26 Sep 2019 08:39:38 GMT ak19 [33527] * gs3-extensions/maori-lang-detection/hdfs-cc-work (moved) Name change for folder Thu, 26 Sep 2019 08:38:14 GMT ak19 [33526] * gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh (deleted) * gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh (deleted) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_Maori_WET_records_from_CCSep2018_on.sh (modified) Moved hadoop related scripts from bin/script into hdfs-instructions Thu, 26 Sep 2019 08:35:38 GMT ak19 [33525] * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_Maori_WET_records_from_CCSep2018_on.sh (moved) Rename before latest version Thu, 26 Sep 2019 08:34:12 GMT ak19 [33524] * gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-instructions/conf (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/conf/ia-hadoop-tools-pom.xml (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/conf/spark-defaults.conf.in (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects/cc-index-table.tar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects/ia-hadoop-tools.tar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects/ia-web-commons.tar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/aws-java-sdk-1.11.616.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/aws-java-sdk-1.7.4.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/guava.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/hadoop-aws-2.7.6.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/patches (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/patches/CCIndexWarcExport.java (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/patches/CCIndexWarcExport.java.orig (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/GS_README (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/export_maori_index_csv.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/export_maori_subset.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/export_maori_subset_from_scratch.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_Maori_WET_records_in_cc_from_Sep2018.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_maori_WET_records_for_crawl.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/limit10_export_index.sh (added) 1. Further adjustments to documenting what we did to get things to ... Thu, 26 Sep 2019 07:00:36 GMT ak19 [33523] * gs3-extensions/maori-lang-detection/bin/script/gen-all-dumps.sh (modified) Instructional comment Thu, 26 Sep 2019 07:00:23 GMT ak19 [33522] * gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh (modified) Some comments and an improvement Tue, 24 Sep 2019 09:40:16 GMT ak19 [33519] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Code still writes out the global seedURLs.txt and regex-urlfilter.txt ... Tue, 24 Sep 2019 09:13:47 GMT ak19 [33518] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Intermediate commit: got the seed urls file temporarily written out ... Tue, 24 Sep 2019 08:30:40 GMT ak19 [33517] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) 1. Blacklists were introduced so that too many instances of ... Tue, 24 Sep 2019 08:14:16 GMT ak19 [33516] * gs3-extensions/maori-lang-detection/bin/script/gen-all-dumps.sh (added) Before I accidentally lose it, committing the script Dr Bainbridge ... Tue, 24 Sep 2019 07:50:40 GMT ak19 [33515] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Removed an unused function Tue, 24 Sep 2019 07:44:04 GMT ak19 [33514] * gs3-extensions/maori-lang-detection/hdfs-instructions (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt (added) Committing README on starting off with the vagrant VM for hadoop- ... Tue, 24 Sep 2019 07:15:01 GMT ak19 [33513] * gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh (added) Higher level script that runs against each named crawl since Sep 2018 ... Mon, 23 Sep 2019 11:16:28 GMT ak19 [33503] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) More efficient blacklisting/greylisting/whitelisting now by reading ...