# # ChangeLog for gs3-extensions/maori-lang-detection # # Generated by Trac 1.4.2 # 2024-05-23T16:57:58+12:00 Thu, 12 Sep 2019 08:00:14 GMT ak19 [33465] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (added) Committing first version of the WETProcessor.java which takes a ... Thu, 05 Sep 2019 07:01:36 GMT ak19 [33457] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Got stage 1, the WARC to WET conversion, working, after necessary ... Thu, 05 Sep 2019 05:26:27 GMT ak19 [33456] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Link to discussion on how to convert WARC to WET Fri, 30 Aug 2019 06:27:21 GMT ak19 [33448] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Minor clarification and inclusion of helpful command Thu, 29 Aug 2019 07:12:39 GMT ak19 [33446] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_subset.sh (added) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_subset_from_scratch.sh (added) 1. Committing working version of export_maori_subset.sh which takes ... Thu, 29 Aug 2019 05:01:12 GMT ak19 [33445] * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts (added) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_index_csv.sh (added) The first working hadoop spark script for processing common crawl ... Wed, 28 Aug 2019 08:22:34 GMT ak19 [33443] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) More notes Wed, 28 Aug 2019 07:30:38 GMT ak19 [33442] * gs3-extensions/maori-lang-detection/lib/gutil.jar (modified) Updated gutil.jar file (with SafeProcses debugging) Wed, 28 Aug 2019 07:30:00 GMT ak19 [33441] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Adding further notes to do with running the CC-index examples on spark. Wed, 28 Aug 2019 07:17:42 GMT ak19 [33440] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (added) Split file to move vagrant-spark-hadoop notes into own file. Mon, 19 Aug 2019 08:31:23 GMT ak19 [33428] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Working commoncrawl cc-warc-examples' WET wordcount example using ... Fri, 16 Aug 2019 10:15:40 GMT ak19 [33425] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) A few more links now that I got past getting the vagrant VM with ... Thu, 15 Aug 2019 08:07:04 GMT ak19 [33423] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Adding in the link to the vagrant VM with Hadoop, Spark for cluster ... Thu, 15 Aug 2019 05:52:19 GMT ak19 [33422] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Some more links. Thu, 15 Aug 2019 04:20:03 GMT ak19 [33419] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Last evening, I had found some links about how language-detection is ... Tue, 13 Aug 2019 09:57:58 GMT ak19 [33414] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Adding important links Tue, 13 Aug 2019 09:57:42 GMT ak19 [33413] * gs3-extensions/maori-lang-detection/bin/script/create-uniq-WET-urls-file.sh (added) * gs3-extensions/maori-lang-detection/bin/script/create-uniq-nz-urls-file.sh (added) * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, ... Tue, 13 Aug 2019 09:54:31 GMT ak19 [33412] * gs3-extensions/maori-lang-detection/conf/config.properties (modified) config command for wgetting a single file Tue, 13 Aug 2019 09:50:29 GMT ak19 [33411] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) Newer version now doesn't mirror sites with wget but gets WET files ... Tue, 13 Aug 2019 09:48:19 GMT ak19 [33410] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) Committing some variable name changes before I replace this file with ... Tue, 13 Aug 2019 03:59:29 GMT ak19 [33409] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/WebScraping.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/macrons_with_emacs.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) Forgot to commit 2 files with links and shuffling some links around ... Tue, 13 Aug 2019 03:09:28 GMT ak19 [33408] * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) Some rough notes. Will move into appropriate file later. Tue, 13 Aug 2019 02:40:50 GMT ak19 [33407] * gs3-extensions/maori-lang-detection/lib/gutil.jar (modified) gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting ... Mon, 12 Aug 2019 08:37:44 GMT ak19 [33405] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) Even though we're probably not going to use this code after all, will ... Mon, 12 Aug 2019 08:35:48 GMT ak19 [33404] * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) 1. Links to other Java ways of extracting text from web content. 2. ... Sun, 11 Aug 2019 10:03:14 GMT ak19 [33402] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (added) Beginnings of the Java class to wget sites and process its pages to ... Sun, 11 Aug 2019 09:16:41 GMT ak19 [33401] * gs3-extensions/maori-lang-detection/logs (added) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.class (deleted) MaoriTextDetector.class file now generated inside its package folder ... Sun, 11 Aug 2019 09:15:26 GMT ak19 [33400] * gs3-extensions/maori-lang-detection/conf/log4j.properties (added) * gs3-extensions/maori-lang-detection/conf/log4j.properties.in (added) * gs3-extensions/maori-lang-detection/lib/log4j-1.2.8.jar (added) 1. Setting up log4j.properties based on the macronizer's basic one ... Sun, 11 Aug 2019 08:48:54 GMT ak19 [33399] * gs3-extensions/maori-lang-detection/conf (added) * gs3-extensions/maori-lang-detection/conf/config.properties (moved) * gs3-extensions/maori-lang-detection/lib/gutil.jar (added) Putting properties files into the conf folder and keeping the lib ... Sun, 11 Aug 2019 07:35:57 GMT ak19 [33398] * gs3-extensions/maori-lang-detection/README.txt (modified) * gs3-extensions/maori-lang-detection/src/org (added) * gs3-extensions/maori-lang-detection/src/org/greenstone (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (moved) Committing the actual package structure and the updated README after ... Sun, 11 Aug 2019 07:30:49 GMT ak19 [33397] * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java (modified) 1. Changing package structure and instructions on compiling/running ... Fri, 09 Aug 2019 08:37:23 GMT ak19 [33394] * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) * gs3-extensions/maori-lang-detection/feasibility.txt (added) * gs3-extensions/maori-lang-detection/lib (added) * gs3-extensions/maori-lang-detection/lib/config.properties (added) 1. Started a file on feasibility with the data now available and some ... Fri, 09 Aug 2019 06:57:12 GMT ak19 [33393] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls ... Wed, 07 Aug 2019 07:11:12 GMT ak19 [33391] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Some rough bash scripting lines that work but aren't complete. Wed, 07 Aug 2019 05:31:10 GMT ak19 [33390] * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Minor message telling the user to wait for a task that takes some time. Wed, 31 Jul 2019 09:09:31 GMT ak19 [33379] * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (added) New script to automate getting a file listing of the common crawl URL ... Wed, 31 Jul 2019 07:05:15 GMT ak19 [33378] * gs3-extensions/maori-lang-detection/bin (added) * gs3-extensions/maori-lang-detection/bin/script (added) * gs3-extensions/maori-lang-detection/bin/script/gen_SentenceDetection_model.sh (moved) New bin/script folder and relocating gen_SentenceDetection_model.sh ... Wed, 31 Jul 2019 07:04:00 GMT ak19 [33377] * gs3-extensions/maori-lang-detection/README.txt (modified) * gs3-extensions/maori-lang-detection/gen_SentenceDetection_model.sh (modified) Changes to get gen_SentenceDetection_model.sh to run still from the ... Wed, 31 Jul 2019 06:39:24 GMT ak19 [33376] * gs3-extensions/maori-lang-detection/MoreReading (added) * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/Heritrix-and-WCT.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/other.txt (added) Links and extracts I've read so far on the Web Curator Tool (WCT), ... Wed, 24 Jul 2019 09:03:29 GMT ak19 [33358] * gs3-extensions/maori-lang-detection/README.txt (modified) More minor changes to README Wed, 24 Jul 2019 09:00:47 GMT ak19 [33357] * gs3-extensions/maori-lang-detection/README.txt (modified) * gs3-extensions/maori-lang-detection/gen_SentenceDetection_model.sh (modified) Minor changes Wed, 24 Jul 2019 08:57:39 GMT ak19 [33356] * gs3-extensions/maori-lang-detection/gen_SentenceDetection_model.sh (modified) Updating script. Correction to a filepath different in the svn folder ... Wed, 24 Jul 2019 08:54:50 GMT ak19 [33355] * gs3-extensions/maori-lang-detection/README.txt (modified) * gs3-extensions/maori-lang-detection/gen_SentenceDetection_model.sh (added) * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts (added) * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts/langdetect-183.bin (moved) * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts/mri-sent.train (added) * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts/mri-sent_trained.bin (added) * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts/sample_maori_shorttext.txt (added) * gs3-extensions/maori-lang-detection/models-trainingdata-and-sampletxts/sample_mri_paragraphs.txt (added) * gs3-extensions/maori-lang-detection/mri-opennlp-corpus.tar.gz (added) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.class (modified) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java (modified) Changes for adding in the new gen_SentenceDetection_model.sh script, ... Tue, 23 Jul 2019 05:29:18 GMT ak19 [33350] * gs3-extensions/maori-lang-detection/README.txt (modified) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.class (modified) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java (modified) Better comments. Tested macronised vs unmacronised Māori language ... Sat, 20 Jul 2019 11:43:53 GMT ak19 [33339] * gs3-extensions/maori-lang-detection/README.txt (modified) Updated README. Sat, 20 Jul 2019 11:24:46 GMT ak19 [33338] * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.class (modified) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java (modified) 1.After renaming the java class, changed all occurrences of the old ... Sat, 20 Jul 2019 11:21:41 GMT ak19 [33337] * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.class (moved) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java (moved) Renaming the class to MaoriTextDetector, since it doesn't detect ... Sat, 20 Jul 2019 10:58:17 GMT ak19 [33336] * gs3-extensions/maori-lang-detection/src/MaoriDetector.class (modified) * gs3-extensions/maori-lang-detection/src/MaoriDetector.java (modified) Major rewrite to make this class more useful to callers. ... Fri, 19 Jul 2019 10:17:21 GMT ak19 [33335] * gs3-extensions/maori-lang-detection (added) * gs3-extensions/maori-lang-detection/README.txt (added) * gs3-extensions/maori-lang-detection/apache-opennlp-1.9.1-bin.tar.gz (added) * gs3-extensions/maori-lang-detection/langdetect-183.bin (added) * gs3-extensions/maori-lang-detection/src (added) * gs3-extensions/maori-lang-detection/src/MaoriDetector.class (added) * gs3-extensions/maori-lang-detection/src/MaoriDetector.java (added) First java file for Māori language detection using openNLP with the ...