Changeset 33598

Show
Ignore:
Timestamp:
22.10.2019 20:19:54 (3 weeks ago)
Author:
ak19
Message:

More instructions on setting up Nutch now that I've remembered to commit the prepared conf files. I've also added the instructions into the top-level GS_README here, since it was a pain untarring the vagrant-for-nutch2 tarball each time just to read the instructions I included in it.

Location:
gs3-extensions/maori-lang-detection/hdfs-cc-work
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33545 r33598  
    1616H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 
    1717I. Setting up Nutch v2 on its own Vagrant VM machine 
     18J. Automated crawling with Nutch v2.3.1 and post-processing 
    1819 
    1920---------------------------------------- 
     
    601602---------------------------------------------------- 
    6026031. Untar vagrant-for-nutch2.tar.gz 
    603 2. Follow the instructions in vagrant-for-nutch2/GS_README.txt 
     6042. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt 
    604605 
    605606--- 
     
    613614(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.) 
    614615 
     616--- 
     617 Vagrant VM for Nutch2 
     618--- 
     619This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark 
     620 
     621However: 
     622- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages. 
     623- the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101) 
     624- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers. 
     625- scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21) 
     626- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form. 
     627 
     628INSTRUCTIONS: 
     629a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark 
     630b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in this zip file. 
     631c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile. 
     632If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows: 
     633- increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and 
     634- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs. 
     635d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section. 
     636e. Inside the VM, install emacs, maven, firefox: 
     637 
     638   sudo apt-get install emacs 
     639 
     640   sudo apt update 
     641   sudo apt install maven 
     642 
     643   sudo apt-get -y install firefox 
     644 
     645f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here. 
     646 
     647After untarring the nutch 2.3.1 source tarball, 
     648  1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig 
     649  2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf 
     650and put them into the apache-nutch-2.3.1/conf folder. 
     651  3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2). 
     652     - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch 
     653     - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end. 
     654 
     655------------------------------------------------------------------------ 
     656J. Automated crawling with Nutch v2.3.1 and post-processing 
     657------------------------------------------------------------------------ 
     6581. When you're ready to start crawling with Nutch 2.3.1, 
     659- copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable. 
     660- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel. 
     661- run batchcrawl.sh on a site or range of sites not yet crawled, e.g. 
     662    ./batchcrawl.sh 00485-00500 
     663 
     6642. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java. 
     665 
    615666 
    616667