# # ChangeLog for gs3-extensions/maori-lang-detection/MoreReading # # Generated by Trac 1.4.2 # 2024-05-24T00:25:05+12:00 Mon, 14 Oct 2019 08:04:58 GMT ak19 [33565] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) CCWETProcessor: domain url now goes in as a seedURL after the ... Thu, 10 Oct 2019 10:41:36 GMT ak19 [33558] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Committing cumulative changes since last commit. Thu, 03 Oct 2019 09:38:00 GMT ak19 [33545] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) Mainly changes to crawling-Nutch.txt and some minor changes to other ... Tue, 01 Oct 2019 09:27:03 GMT ak19 [33541] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/patches/GZRangeClient.java (added) * gs3-extensions/maori-lang-detection/hdfs-cc-work/patches/WATExtractorOutput.java (added) 1. hdfs-cc-work/GS_README.txt now contains the complete instructions ... Tue, 01 Oct 2019 08:40:33 GMT ak19 [33540] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Since I wasn't getting further with nutch 2 to grab an entire site, I ... Mon, 30 Sep 2019 09:51:36 GMT ak19 [33537] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) More nutch and general site mirroring related links Thu, 26 Sep 2019 10:22:07 GMT ak19 [33529] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Forgot to add most basic nutch links Thu, 26 Sep 2019 09:47:13 GMT ak19 [33528] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (moved) Adding in Nutch links Mon, 23 Sep 2019 05:59:07 GMT ak19 [33499] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Explicitly adding in IAM policy configuration details instead of just ... Sun, 22 Sep 2019 07:23:28 GMT ak19 [33496] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Minor changes to reading list file Fri, 13 Sep 2019 05:44:41 GMT ak19 [33467] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/conf/config.properties (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Improved the code to use a static block to load the needed properties ... Thu, 05 Sep 2019 07:01:36 GMT ak19 [33457] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Got stage 1, the WARC to WET conversion, working, after necessary ... Thu, 05 Sep 2019 05:26:27 GMT ak19 [33456] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Link to discussion on how to convert WARC to WET Fri, 30 Aug 2019 06:27:21 GMT ak19 [33448] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Minor clarification and inclusion of helpful command Thu, 29 Aug 2019 07:12:39 GMT ak19 [33446] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_subset.sh (added) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_subset_from_scratch.sh (added) 1. Committing working version of export_maori_subset.sh which takes ... Wed, 28 Aug 2019 08:22:34 GMT ak19 [33443] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) More notes Wed, 28 Aug 2019 07:30:00 GMT ak19 [33441] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Adding further notes to do with running the CC-index examples on spark. Wed, 28 Aug 2019 07:17:42 GMT ak19 [33440] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (added) Split file to move vagrant-spark-hadoop notes into own file. Mon, 19 Aug 2019 08:31:23 GMT ak19 [33428] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Working commoncrawl cc-warc-examples' WET wordcount example using ... Fri, 16 Aug 2019 10:15:40 GMT ak19 [33425] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) A few more links now that I got past getting the vagrant VM with ... Thu, 15 Aug 2019 08:07:04 GMT ak19 [33423] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Adding in the link to the vagrant VM with Hadoop, Spark for cluster ... Thu, 15 Aug 2019 05:52:19 GMT ak19 [33422] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Some more links. Thu, 15 Aug 2019 04:20:03 GMT ak19 [33419] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Last evening, I had found some links about how language-detection is ... Tue, 13 Aug 2019 09:57:58 GMT ak19 [33414] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Adding important links Tue, 13 Aug 2019 03:59:29 GMT ak19 [33409] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/WebScraping.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/macrons_with_emacs.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) Forgot to commit 2 files with links and shuffling some links around ... Tue, 13 Aug 2019 03:09:28 GMT ak19 [33408] * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) Some rough notes. Will move into appropriate file later. Mon, 12 Aug 2019 08:35:48 GMT ak19 [33404] * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) 1. Links to other Java ways of extracting text from web content. 2. ... Fri, 09 Aug 2019 06:57:12 GMT ak19 [33393] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls ... Wed, 07 Aug 2019 07:11:12 GMT ak19 [33391] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Some rough bash scripting lines that work but aren't complete. Wed, 31 Jul 2019 06:39:24 GMT ak19 [33376] * gs3-extensions/maori-lang-detection/MoreReading (added) * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/Heritrix-and-WCT.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/other.txt (added) Links and extracts I've read so far on the Web Curator Tool (WCT), ...