source: gs3-extensions/maori-lang-detection/MoreReading

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @33565   5 years ak19 CCWETProcessor: domain url now goes in as a seedURL after the …
(edit) @33558   5 years ak19 Committing cumulative changes since last commit.
(edit) @33545   5 years ak19 Mainly changes to crawling-Nutch.txt and some minor changes to other …
(edit) @33541   5 years ak19 1. hdfs-cc-work/GS_README.txt now contains the complete instructions …
(edit) @33540   5 years ak19 Since I wasn't getting further with nutch 2 to grab an entire site, I …
(edit) @33537   5 years ak19 More nutch and general site mirroring related links
(edit) @33529   5 years ak19 Forgot to add most basic nutch links
(edit) @33528   5 years ak19 Adding in Nutch links
(edit) @33499   5 years ak19 Explicitly adding in IAM policy configuration details instead of just …
(edit) @33496   5 years ak19 Minor changes to reading list file
(edit) @33467   5 years ak19 Improved the code to use a static block to load the needed properties …
(edit) @33457   5 years ak19 Got stage 1, the WARC to WET conversion, working, after necessary …
(edit) @33456   5 years ak19 Link to discussion on how to convert WARC to WET
(edit) @33448   5 years ak19 Minor clarification and inclusion of helpful command
(edit) @33446   5 years ak19 1. Committing working version of export_maori_subset.sh which takes …
(edit) @33443   5 years ak19 More notes
(edit) @33441   5 years ak19 Adding further notes to do with running the CC-index examples on spark.
(edit) @33440   5 years ak19 Split file to move vagrant-spark-hadoop notes into own file.
(edit) @33428   5 years ak19 Working commoncrawl cc-warc-examples' WET wordcount example using …
(edit) @33425   5 years ak19 A few more links now that I got past getting the vagrant VM with spark …
(edit) @33423   5 years ak19 Adding in the link to the vagrant VM with Hadoop, Spark for cluster …
(edit) @33422   5 years ak19 Some more links.
(edit) @33419   5 years ak19 Last evening, I had found some links about how language-detection is …
(edit) @33414   5 years ak19 Adding important links
(edit) @33409   5 years ak19 Forgot to commit 2 files with links and shuffling some links around …
(edit) @33408   5 years ak19 Some rough notes. Will move into appropriate file later.
(edit) @33404   5 years ak19 1. Links to other Java ways of extracting text from web content. 2. …
(edit) @33393   5 years ak19 Modified the get_commoncrawl_nz_urls.sh to also create a reduced urls …
(edit) @33391   5 years ak19 Some rough bash scripting lines that work but aren't complete.
(add) @33376   5 years ak19 Links and extracts I've read so far on the Web Curator Tool (WCT), …
Note: See TracRevisionLog for help on using the revision log.