Changeset 33543
- Timestamp:
- 2019-10-02T17:01:47+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection/hdfs-cc-work
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT
r33541 r33543 15 15 --- 16 16 H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 17 I. Setting up Nutch v2 on its own Vagrant VM machine 17 18 18 19 ---------------------------------------- … … 558 559 (default location and filename unless you pass flags to crawl CLI to control these) 559 560 560 a. Ensure you get crawl.warc.gz onto the vagrant VM with the WARC to WET git projectsinstalled, recompiled with the above modifications.561 a. Ensure you get crawl.warc.gz onto the vagrant VM that has the git projects for WARC-to-WET installed, recompiled with the above modifications. 561 562 562 563 b. Now, create the folder structure needed for warc-to-wet conversion: … … 575 576 576 577 More meaningful when the WARC_FOLDER contains multiple *.warc.gz files, 577 as the above will use map-reduce to generate the *.warc.wet.gz files in the output wet folder.578 as the above will use map-reduce to generate a *.warc.wet.gz file in the output wet folder for each input *.warc.gz file. 578 579 579 580 e. Copy the generated wet files across from /user/vagrant/warctest/wet/: … … 595 596 596 597 598 ---------------------------------------------------- 599 I. Setting up Nutch v2 on its own Vagrant VM machine 600 ---------------------------------------------------- 601 1. Untar vagrant-for-nutch2.tar.gz 602 2. Follow the instructions in vagrant-for-nutch2/GS_README.txt 603 604 --- 605 REASONING FOR THE NUTCH v2 SPECIFIC VAGRANT VM: 606 --- 607 We were able to get nutch v1 working on a regular machine. 608 609 From a few pages online starting with https://stackoverflow.com/questions/33354460/nutch-clone-website, it appeared that "./bin/nutch fetch -all" was the nutch command to mirror a web site. Nutch v2 introduced the -all flag to the ./bin/nutch fetch command. And nutch v2 required HBase which presupposes hadoop. 610 611 Our vagrant VM for commoncrawl had an incompatible version of HBase but this version was needed for that VM's version of hadoop and spark. So Dr Bainbridge came up with the idea of having a separate Vagrant VM for Nutch v2 which would have the version of HBase it needed and a Hadoop version matching that. Compatible versions with nutch 2.3.1 are mentioned at https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/ 612 (Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.) 613 614 615 597 616 -----------------------EOF------------------------ 598 617
Note:
See TracChangeset
for help on using the changeset viewer.