Changeset 33543


Ignore:
Timestamp:
2019-10-02T17:01:47+13:00 (5 years ago)
Author:
ak19
Message:

Filled in some missing instructions

Location:
gs3-extensions/maori-lang-detection/hdfs-cc-work
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33541 r33543  
    1515---
    1616H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
     17I. Setting up Nutch v2 on its own Vagrant VM machine
    1718
    1819----------------------------------------
     
    558559(default location and filename unless you pass flags to crawl CLI to control these)
    559560
    560 a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications.
     561a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications.
    561562
    562563b. Now, create the folder structure needed for warc-to-wet conversion:
     
    575576
    576577More meaningful when the WARC_FOLDER contains multiple *.warc.gz files,
    577 as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder.
     578as the above will use map-reduce to generate a *.warc.wet.gz file in the output wet folder for each input *.warc.gz file.
    578579
    579580e. Copy the generated wet files across from /user/vagrant/warctest/wet/:
     
    595596
    596597
     598----------------------------------------------------
     599I. Setting up Nutch v2 on its own Vagrant VM machine
     600----------------------------------------------------
     6011. Untar vagrant-for-nutch2.tar.gz
     6022. Follow the instructions in vagrant-for-nutch2/GS_README.txt
     603
     604---
     605REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM:
     606---
     607We were able to get nutch v1 working on a regular machine.
     608
     609From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop.
     610
     611Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
     612(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
     613
     614
     615
    597616-----------------------EOF------------------------
    598617
Note: See TracChangeset for help on using the changeset viewer.