# # ChangeLog for gs3-extensions # # Generated by Trac 1.4.2 # 2024-04-19T14:48:59+12:00 Thu, 10 Oct 2019 10:49:58 GMT ak19 [33560] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) 1. Incorporated Dr Bainbridge's suggested improvements: only when ... Thu, 10 Oct 2019 10:44:31 GMT ak19 [33559] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt (modified) 1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge ... Thu, 10 Oct 2019 10:41:36 GMT ak19 [33558] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Committing cumulative changes since last commit. Wed, 09 Oct 2019 10:10:06 GMT ak19 [33557] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Implemented the topSitesMap of topsite domain to url pattern in the ... Wed, 09 Oct 2019 05:58:30 GMT ak19 [33556] * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) Blacklisted wikipedia pages that are actually in other languages ... Wed, 09 Oct 2019 05:43:47 GMT ak19 [33555] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) Modified top sites list as Dr Bainbridge described: suffixes for the ... Wed, 09 Oct 2019 05:11:19 GMT ak19 [33554] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) Added more to blacklist and greylist. And removed remaining ... Fri, 04 Oct 2019 09:19:20 GMT ak19 [33553] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) Comments Fri, 04 Oct 2019 09:00:46 GMT ak19 [33552] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) 1. Code now processes ccrawldata folder, containing each individual ... Fri, 04 Oct 2019 06:35:06 GMT ak19 [33551] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) Added in top 500 urls from moz.com/top500 and removed duplicates, and ... Fri, 04 Oct 2019 06:06:51 GMT ak19 [33550] * gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (added) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) First stage of introducing sites-too-big-to-exhaustively-crawl.tx: ... Fri, 04 Oct 2019 05:29:50 GMT ak19 [33549] * gs3-extensions/maori-lang-detection/ccrawl-data (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135334-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135335-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135533-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135534-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135731-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135732-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135930-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926135930-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926140130-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-39-wet-files/MAORI-CC-MAIN-2018-39-20190926140132-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927111950-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927111952-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112247-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112247-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112539-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112540-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112830-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927112832-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927113121-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-43-wet-files/MAORI-CC-MAIN-2018-43-20190927113122-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930134759-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930134801-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135217-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135218-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135634-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930135637-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140053-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140056-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140510-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-47-wet-files/MAORI-CC-MAIN-2018-47-20190930140512-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112358-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112358-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112629-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112631-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112900-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002112901-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113130-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113131-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113401-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2018-51-wet-files/MAORI-CC-MAIN-2018-51-20191002113401-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085129-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085129-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085435-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085437-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085739-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924085740-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090041-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090044-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090347-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-04-wet-files/MAORI-CC-MAIN-2019-04-20190924090348-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924031741-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924031742-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032031-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032034-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032319-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032319-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032606-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032607-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032851-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-09-wet-files/MAORI-CC-MAIN-2019-09-20190924032854-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923212744-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923212748-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213222-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213227-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213659-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923213702-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214137-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214138-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214614-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-13-wet-files/MAORI-CC-MAIN-2019-13-20190923214616-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923161945-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923161945-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162223-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162223-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162500-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162502-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162737-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923162739-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923163013-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-18-wet-files/MAORI-CC-MAIN-2019-18-20190923163015-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094332-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094332-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094842-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923094845-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095357-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095358-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095911-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923095912-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923100426-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-22-wet-files/MAORI-CC-MAIN-2019-22-20190923100427-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035248-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035249-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035802-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923035802-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040326-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040331-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040848-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923040849-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923041403-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-26-wet-files/MAORI-CC-MAIN-2019-26-20190923041404-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100139-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100141-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100451-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100453-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100805-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902100809-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101119-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101119-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101429-000008.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-30-wet-files/MAORI-CC-2019-30-20190902101429-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102618-000000.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102618-000000.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102621-000001.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921102621-000001.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000002.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000002.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000003.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103116-000003.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103611-000004.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103611-000004.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103613-000005.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921103613-000005.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000006.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000006.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000007.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104105-000007.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104558-000009.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104558-000009.warc.wet.gz (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104559-000008.warc.wet (added) * gs3-extensions/maori-lang-detection/ccrawl-data/CC-MAIN-2019-35-wet-files/MAORI-CC-MAIN-2019-35-20190921104559-000008.warc.wet.gz (added) All the downloaded commoncrawl MRI warc.wet.gz data from Sep 2018 ... Fri, 04 Oct 2019 01:36:53 GMT davidb [33548] * gs3-extensions/web-audio/trunk/INSTALL.sh (modified) Include new wavesurfer sub-project to install Fri, 04 Oct 2019 01:19:51 GMT davidb [33546] * gs3-extensions/web-audio/trunk/wavesurfer (added) * gs3-extensions/web-audio/trunk/wavesurfer/INSTALL.sh (added) * gs3-extensions/web-audio/trunk/wavesurfer/css (added) * gs3-extensions/web-audio/trunk/wavesurfer/css/ribbon.css (added) * gs3-extensions/web-audio/trunk/wavesurfer/css/style.css (added) * gs3-extensions/web-audio/trunk/wavesurfer/devel (added) * gs3-extensions/web-audio/trunk/wavesurfer/devel/node-v10.16.3-darwin-x64.tar.gz (added) * gs3-extensions/web-audio/trunk/wavesurfer/src (added) * gs3-extensions/web-audio/trunk/wavesurfer/src/wavesurfer.js-2.2.1.tar.gz (added) * gs3-extensions/web-audio/trunk/wavesurfer/wavesurfer-player.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.cursor.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.elan.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.mediasession.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.microphone.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.minimap.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.regions.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.spectrogram.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/plugin/wavesurfer.timeline.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer-html-init.min.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.js.map (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.min.js (added) * gs3-extensions/web-audio/trunk/wavesurfer/ws/wavesurfer.min.js.map (added) Initial cut at wave-surfer based JS audio player extension for Greenstone Thu, 03 Oct 2019 09:38:00 GMT ak19 [33545] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) Mainly changes to crawling-Nutch.txt and some minor changes to other ... Wed, 02 Oct 2019 04:01:47 GMT ak19 [33543] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz (modified) Filled in some missing instructions Tue, 01 Oct 2019 09:27:03 GMT ak19 [33541] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/patches/GZRangeClient.java (added) * gs3-extensions/maori-lang-detection/hdfs-cc-work/patches/WATExtractorOutput.java (added) 1. hdfs-cc-work/GS_README.txt now contains the complete instructions ... Tue, 01 Oct 2019 08:40:33 GMT ak19 [33540] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Since I wasn't getting further with nutch 2 to grab an entire site, I ... Tue, 01 Oct 2019 08:36:38 GMT ak19 [33539] * gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT (moved) File rename Tue, 01 Oct 2019 08:36:06 GMT ak19 [33538] * gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/setup.sh (modified) Some additions to the setup.sh script to query commoncrawl for MRI ... Mon, 30 Sep 2019 09:51:36 GMT ak19 [33537] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) More nutch and general site mirroring related links Mon, 30 Sep 2019 08:28:38 GMT ak19 [33536] * gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz (added) Changes required to the commoncrawl related Vagrant github project to ... Mon, 30 Sep 2019 03:49:19 GMT ak19 [33535] * gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/setup.sh (added) 1. New setup.sh script for on a hadoop system to setup the git ... Fri, 27 Sep 2019 05:05:40 GMT ak19 [33534] * gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/get_Maori_WET_records_from_CCSep2018_on.sh (modified) Correction: toplevel script has to be placed inside cc-index-table ... Thu, 26 Sep 2019 11:06:11 GMT ak19 [33532] * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) Found the other top 500 sites link again at last which Dr Bainbridge ... Thu, 26 Sep 2019 11:03:01 GMT ak19 [33531] * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (modified) * gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt (added) Added whitelist for mi.wikipedia.org, and updates to blacklist and ... Thu, 26 Sep 2019 10:41:56 GMT ak19 [33530] * gs3-extensions/maori-lang-detection/hdfs-cc-work/Readme.txt (modified) Completed sentence that was left hanging. Thu, 26 Sep 2019 10:22:07 GMT ak19 [33529] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (modified) Forgot to add most basic nutch links Thu, 26 Sep 2019 09:47:13 GMT ak19 [33528] * gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt (moved) Adding in Nutch links Thu, 26 Sep 2019 08:39:38 GMT ak19 [33527] * gs3-extensions/maori-lang-detection/hdfs-cc-work (moved) Name change for folder Thu, 26 Sep 2019 08:38:14 GMT ak19 [33526] * gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh (deleted) * gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh (deleted) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_Maori_WET_records_from_CCSep2018_on.sh (modified) Moved hadoop related scripts from bin/script into hdfs-instructions Thu, 26 Sep 2019 08:35:38 GMT ak19 [33525] * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_Maori_WET_records_from_CCSep2018_on.sh (moved) Rename before latest version Thu, 26 Sep 2019 08:34:12 GMT ak19 [33524] * gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt (modified) * gs3-extensions/maori-lang-detection/hdfs-instructions/conf (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/conf/ia-hadoop-tools-pom.xml (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/conf/spark-defaults.conf.in (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects/cc-index-table.tar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects/ia-hadoop-tools.tar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/gitprojects/ia-web-commons.tar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/aws-java-sdk-1.11.616.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/aws-java-sdk-1.7.4.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/guava.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/jars/hadoop-aws-2.7.6.jar (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/patches (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/patches/CCIndexWarcExport.java (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/patches/CCIndexWarcExport.java.orig (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/GS_README (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/export_maori_index_csv.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/export_maori_subset.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/export_maori_subset_from_scratch.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_Maori_WET_records_in_cc_from_Sep2018.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/get_maori_WET_records_for_crawl.sh (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/scripts/limit10_export_index.sh (added) 1. Further adjustments to documenting what we did to get things to ... Thu, 26 Sep 2019 07:00:36 GMT ak19 [33523] * gs3-extensions/maori-lang-detection/bin/script/gen-all-dumps.sh (modified) Instructional comment Thu, 26 Sep 2019 07:00:23 GMT ak19 [33522] * gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh (modified) Some comments and an improvement Tue, 24 Sep 2019 09:40:16 GMT ak19 [33519] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Code still writes out the global seedURLs.txt and regex-urlfilter.txt ... Tue, 24 Sep 2019 09:13:47 GMT ak19 [33518] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Intermediate commit: got the seed urls file temporarily written out ... Tue, 24 Sep 2019 08:30:40 GMT ak19 [33517] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) 1. Blacklists were introduced so that too many instances of ... Tue, 24 Sep 2019 08:14:16 GMT ak19 [33516] * gs3-extensions/maori-lang-detection/bin/script/gen-all-dumps.sh (added) Before I accidentally lose it, committing the script Dr Bainbridge ... Tue, 24 Sep 2019 07:50:40 GMT ak19 [33515] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) Removed an unused function Tue, 24 Sep 2019 07:44:04 GMT ak19 [33514] * gs3-extensions/maori-lang-detection/hdfs-instructions (added) * gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt (added) Committing README on starting off with the vagrant VM for hadoop- ... Tue, 24 Sep 2019 07:15:01 GMT ak19 [33513] * gs3-extensions/maori-lang-detection/bin/script/get_Maori_WET_records_from_CCSep2018_on.sh (added) Higher level script that runs against each named crawl since Sep 2018 ... Mon, 23 Sep 2019 11:16:28 GMT ak19 [33503] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) More efficient blacklisting/greylisting/whitelisting now by reading ... Mon, 23 Sep 2019 11:11:29 GMT ak19 [33502] * gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt (added) * gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt (added) Current url pattern blacklist and greylist filter files. Used by ... Mon, 23 Sep 2019 09:28:06 GMT ak19 [33501] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Refactored code into 2 classes: The existing WETProcessor, which ... Mon, 23 Sep 2019 05:59:07 GMT ak19 [33499] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Explicitly adding in IAM policy configuration details instead of just ... Mon, 23 Sep 2019 04:43:22 GMT ak19 [33498] * gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh (modified) Corrections to script. Modified the tests checking for file/dir ... Sun, 22 Sep 2019 09:17:48 GMT ak19 [33497] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) First version of discard url filter file. Inefficient implementation. ... Sun, 22 Sep 2019 07:23:28 GMT ak19 [33496] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Minor changes to reading list file Sun, 22 Sep 2019 07:19:36 GMT ak19 [33495] * gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh (modified) Pruned out unused commands, added comments, marked unused variables ... Sat, 21 Sep 2019 10:49:56 GMT ak19 [33494] * gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh (added) All in one script that takes as parameter a common crawl identifier ... Wed, 18 Sep 2019 08:20:09 GMT ak19 [33489] * gs3-extensions/maori-lang-detection/bin/script/drop_nutch_solrcore.sh (added) Handy file to not have to keep manually repeating commands when ... Tue, 17 Sep 2019 02:48:36 GMT ak19 [33488] * gs3-extensions/maori-lang-detection/bin/script/unique_mri_domains_from_cc.sh (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) new function createSeedURLsFiles() in WETProcessor that replaces the ... Mon, 16 Sep 2019 07:45:01 GMT ak19 [33480] * gs3-extensions/maori-lang-detection/conf/config.properties (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Much harder to remove pages where words are fused together as some ... Fri, 13 Sep 2019 10:57:38 GMT ak19 [33471] * gs3-extensions/maori-lang-detection/bin/script/unique_mri_domains_from_cc.sh (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Very minor changes. Fri, 13 Sep 2019 10:53:23 GMT ak19 [33470] * gs3-extensions/maori-lang-detection/bin/script/unique_mri_domains_from_cc.sh (added) A new script to reduce keepURLs.txt to unique URLs, 1 from each ... Fri, 13 Sep 2019 09:46:09 GMT ak19 [33469] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Don't want URLs with the word product(s) in them (but production ... Fri, 13 Sep 2019 07:24:27 GMT ak19 [33468] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) More meaningful to (also) write out the keep vs discard URLs into ... Fri, 13 Sep 2019 05:44:41 GMT ak19 [33467] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/conf/config.properties (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) Improved the code to use a static block to load the needed properties ... Thu, 12 Sep 2019 09:37:39 GMT ak19 [33466] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (modified) 1. WETProcessor.main() now processes a folder of *.warc.wet(.gz) ... Thu, 12 Sep 2019 08:00:14 GMT ak19 [33465] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/WETProcessor.java (added) Committing first version of the WETProcessor.java which takes a ... Thu, 05 Sep 2019 07:01:36 GMT ak19 [33457] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Got stage 1, the WARC to WET conversion, working, after necessary ... Thu, 05 Sep 2019 05:26:27 GMT ak19 [33456] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Link to discussion on how to convert WARC to WET Fri, 30 Aug 2019 06:27:21 GMT ak19 [33448] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Minor clarification and inclusion of helpful command Thu, 29 Aug 2019 07:12:39 GMT ak19 [33446] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_subset.sh (added) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_subset_from_scratch.sh (added) 1. Committing working version of export_maori_subset.sh which takes ... Thu, 29 Aug 2019 05:01:12 GMT ak19 [33445] * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts (added) * gs3-extensions/maori-lang-detection/bin/hadoop-spark-scripts/export_maori_index_csv.sh (added) The first working hadoop spark script for processing common crawl ... Wed, 28 Aug 2019 08:22:34 GMT ak19 [33443] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) More notes Wed, 28 Aug 2019 07:30:38 GMT ak19 [33442] * gs3-extensions/maori-lang-detection/lib/gutil.jar (modified) Updated gutil.jar file (with SafeProcses debugging) Wed, 28 Aug 2019 07:30:00 GMT ak19 [33441] * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) Adding further notes to do with running the CC-index examples on spark. Wed, 28 Aug 2019 07:17:42 GMT ak19 [33440] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (added) Split file to move vagrant-spark-hadoop notes into own file. Mon, 19 Aug 2019 08:31:23 GMT ak19 [33428] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Working commoncrawl cc-warc-examples' WET wordcount example using ... Fri, 16 Aug 2019 10:15:40 GMT ak19 [33425] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) A few more links now that I got past getting the vagrant VM with ... Thu, 15 Aug 2019 08:07:04 GMT ak19 [33423] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Adding in the link to the vagrant VM with Hadoop, Spark for cluster ... Thu, 15 Aug 2019 05:52:19 GMT ak19 [33422] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Some more links. Thu, 15 Aug 2019 04:20:03 GMT ak19 [33419] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Last evening, I had found some links about how language-detection is ... Tue, 13 Aug 2019 09:57:58 GMT ak19 [33414] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Adding important links Tue, 13 Aug 2019 09:57:42 GMT ak19 [33413] * gs3-extensions/maori-lang-detection/bin/script/create-uniq-WET-urls-file.sh (added) * gs3-extensions/maori-lang-detection/bin/script/create-uniq-nz-urls-file.sh (added) * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Splitting the get_commoncrawl_nz_urls.sh script back into 2 scripts, ... Tue, 13 Aug 2019 09:54:31 GMT ak19 [33412] * gs3-extensions/maori-lang-detection/conf/config.properties (modified) config command for wgetting a single file Tue, 13 Aug 2019 09:50:29 GMT ak19 [33411] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) Newer version now doesn't mirror sites with wget but gets WET files ... Tue, 13 Aug 2019 09:48:19 GMT ak19 [33410] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) Committing some variable name changes before I replace this file with ... Tue, 13 Aug 2019 03:59:29 GMT ak19 [33409] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/MoreReading/WebScraping.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/macrons_with_emacs.txt (added) * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) Forgot to commit 2 files with links and shuffling some links around ... Tue, 13 Aug 2019 03:09:28 GMT ak19 [33408] * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) Some rough notes. Will move into appropriate file later. Tue, 13 Aug 2019 02:40:50 GMT ak19 [33407] * gs3-extensions/maori-lang-detection/lib/gutil.jar (modified) gutil.jar was rebuilt yesterday in GS3 after a bugfix. Recommitting ... Mon, 12 Aug 2019 08:37:44 GMT ak19 [33405] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (modified) Even though we're probably not going to use this code after all, will ... Mon, 12 Aug 2019 08:35:48 GMT ak19 [33404] * gs3-extensions/maori-lang-detection/MoreReading/other.txt (modified) 1. Links to other Java ways of extracting text from web content. 2. ... Sun, 11 Aug 2019 10:03:14 GMT ak19 [33402] * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NZTLDProcessor.java (added) Beginnings of the Java class to wget sites and process its pages to ... Sun, 11 Aug 2019 09:16:41 GMT ak19 [33401] * gs3-extensions/maori-lang-detection/logs (added) * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.class (deleted) MaoriTextDetector.class file now generated inside its package folder ... Sun, 11 Aug 2019 09:15:26 GMT ak19 [33400] * gs3-extensions/maori-lang-detection/conf/log4j.properties (added) * gs3-extensions/maori-lang-detection/conf/log4j.properties.in (added) * gs3-extensions/maori-lang-detection/lib/log4j-1.2.8.jar (added) 1. Setting up log4j.properties based on the macronizer's basic one ... Sun, 11 Aug 2019 08:48:54 GMT ak19 [33399] * gs3-extensions/maori-lang-detection/conf (added) * gs3-extensions/maori-lang-detection/conf/config.properties (moved) * gs3-extensions/maori-lang-detection/lib/gutil.jar (added) Putting properties files into the conf folder and keeping the lib ... Sun, 11 Aug 2019 07:35:57 GMT ak19 [33398] * gs3-extensions/maori-lang-detection/README.txt (modified) * gs3-extensions/maori-lang-detection/src/org (added) * gs3-extensions/maori-lang-detection/src/org/greenstone (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea (added) * gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MaoriTextDetector.java (moved) Committing the actual package structure and the updated README after ... Sun, 11 Aug 2019 07:30:49 GMT ak19 [33397] * gs3-extensions/maori-lang-detection/src/MaoriTextDetector.java (modified) 1. Changing package structure and instructions on compiling/running ... Sun, 11 Aug 2019 06:20:14 GMT ak19 [33396] * gs3-extensions/solr/trunk/src/collect/solr-jdbm-demo/resources/collectionConfig_ka.properties (modified) * main/trunk/greenstone3/web/sites/localsite/collect/lucene-jdbm-demo/resources/collectionConfig_es.properties (modified) * main/trunk/greenstone3/web/sites/localsite/collect/lucene-jdbm-demo/resources/collectionConfig_fr.properties (modified) * main/trunk/greenstone3/web/sites/localsite/collect/lucene-jdbm-demo/resources/collectionConfig_gu.properties (modified) * main/trunk/greenstone3/web/sites/localsite/collect/lucene-jdbm-demo/resources/collectionConfig_ja.properties (modified) * main/trunk/greenstone3/web/sites/localsite/collect/lucene-jdbm-demo/resources/collectionConfig_ka.properties (modified) * main/trunk/greenstone3/web/sites/localsite/collect/lucene-jdbm-demo/resources/collectionConfig_pl.properties (modified) * main/trunk/greenstone3/web/sites/localsite/resources/siteConfig_ka.properties (modified) Georgian language gs3colcfg module of GS interface. Many thanks to ... Fri, 09 Aug 2019 08:37:23 GMT ak19 [33394] * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) * gs3-extensions/maori-lang-detection/feasibility.txt (added) * gs3-extensions/maori-lang-detection/lib (added) * gs3-extensions/maori-lang-detection/lib/config.properties (added) 1. Started a file on feasibility with the data now available and some ... Fri, 09 Aug 2019 06:57:12 GMT ak19 [33393] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls ... Thu, 08 Aug 2019 03:15:11 GMT ak19 [33392] * gs3-extensions/solr/trunk/src/perllib/solrbuilder.pm (modified) * gs3-extensions/solr/trunk/src/perllib/solrserver.pm (modified) * main/trunk/greenstone2/bin/script/activate.pl (modified) Kathy found a problem whereby she wanted to run consecutive buildcols ... Wed, 07 Aug 2019 07:11:12 GMT ak19 [33391] * gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) Some rough bash scripting lines that work but aren't complete. Wed, 07 Aug 2019 05:31:10 GMT ak19 [33390] * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (modified) Minor message telling the user to wait for a task that takes some time. Mon, 05 Aug 2019 23:46:09 GMT kjdon [33388] * gs3-extensions/solr/trunk/src/src/java/org/greenstone/gsdl3/util/SolrQueryWrapper.java (modified) tidied up some debug statements Wed, 31 Jul 2019 09:09:31 GMT ak19 [33379] * gs3-extensions/maori-lang-detection/bin/script/get_commoncrawl_nz_urls.sh (added) New script to automate getting a file listing of the common crawl URL ... Wed, 31 Jul 2019 07:05:15 GMT ak19 [33378] * gs3-extensions/maori-lang-detection/bin (added) * gs3-extensions/maori-lang-detection/bin/script (added) * gs3-extensions/maori-lang-detection/bin/script/gen_SentenceDetection_model.sh (moved) New bin/script folder and relocating gen_SentenceDetection_model.sh ...