Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33543

Timestamp:

2019-10-02T17:01:47+13:00 (5 years ago)

Author:

ak19

Message:

Filled in some missing instructions

Location:

gs3-extensions/maori-lang-detection/hdfs-cc-work

Files:

: 2 edited

GS_README.TXT (modified) (4 diffs)
vagrant-for-nutch2.tar.gz (modified) ( previous)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

-              r33541
+              r33543
 ---
 H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
+I. Setting up Nutch v2 on its own Vagrant VM machine
 ----------------------------------------
 …
 (default location and filename unless you pass flags to crawl CLI to control these)
 a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications.
+a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications.
 b. Now, create the folder structure needed for warc-to-wet conversion:
 …
 More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
 as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder.
+as the above will use map-reduce to generate a *.warc.wet.gz file in the output wet folder for each input *.warc.gz file.
 e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
 …
+----------------------------------------------------
+I. Setting up Nutch v2 on its own Vagrant VM machine
+----------------------------------------------------
+. Untar vagrant-for-nutch2.tar.gz
+. Follow the instructions in vagrant-for-nutch2/GS_README.txt
+---
+REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM:
+---
+We were able to get nutch v1 working on a regular machine.
+From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop.
+Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
+(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
 -----------------------EOF------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33543

Legend:

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

Download in other formats: