Changeset 33543

Show
Ignore:
Timestamp:
02.10.2019 17:01:47 (2 weeks ago)
Author:
ak19
Message:

Filled in some missing instructions

Location:
gs3-extensions/maori-lang-detection/hdfs-cc-work
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33541 r33543  
    1515--- 
    1616H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 
     17I. Setting up Nutch v2 on its own Vagrant VM machine 
    1718 
    1819---------------------------------------- 
     
    558559(default location and filename unless you pass flags to crawl CLI to control these) 
    559560 
    560 a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projects installed, recompiled with the above modifications. 
     561a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications. 
    561562 
    562563b. Now, create the folder structure needed for warc-to-wet conversion: 
     
    575576 
    576577More meaningful when the WARC_FOLDER contains multiple *.warc.gz files, 
    577 as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder. 
     578as the above will use map-reduce to generate a *.warc.wet.gz file in the output wet folder for each input *.warc.gz file. 
    578579 
    579580e. Copy the generated wet files across from /user/vagrant/warctest/wet/: 
     
    595596 
    596597 
     598---------------------------------------------------- 
     599I. Setting up Nutch v2 on its own Vagrant VM machine 
     600---------------------------------------------------- 
     6011. Untar vagrant-for-nutch2.tar.gz 
     6022. Follow the instructions in vagrant-for-nutch2/GS_README.txt 
     603 
     604--- 
     605REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM: 
     606--- 
     607We were able to get nutch v1 working on a regular machine. 
     608 
     609From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop. 
     610 
     611Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/ 
     612(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.) 
     613 
     614 
     615 
    597616-----------------------EOF------------------------ 
    598617