Changeset 33598 for gs3-extensions


Ignore:
Timestamp:
2019-10-22T20:19:54+13:00 (5 years ago)
Author:
ak19
Message:

More instructions on setting up Nutch now that I've remembered to commit the prepared conf files. I've also added the instructions into the top-level GS_README here, since it was a pain untarring the vagrant-for-nutch2 tarball each time just to read the instructions I included in it.

Location:
gs3-extensions/maori-lang-detection/hdfs-cc-work
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33545 r33598  
    1616H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
    1717I. Setting up Nutch v2 on its own Vagrant VM machine
     18J. Automated crawling with Nutch v2.3.1 and post-processing
    1819
    1920----------------------------------------
     
    601602----------------------------------------------------
    6026031. Untar vagrant-for-nutch2.tar.gz
    603 2. Follow the instructions in vagrant-for-nutch2/GS_README.txt
     6042. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt
    604605
    605606---
     
    613614(Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
    614615
     616---
     617 Vagrant VM for Nutch2
     618---
     619This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark
     620
     621However:
     622- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages.
     623- the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101)
     624- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers.
     625- scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21)
     626- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form.
     627
     628INSTRUCTIONS:
     629a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark
     630b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in this zip file.
     631c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile.
     632If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows:
     633- increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and
     634- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs.
     635d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section.
     636e. Inside the VM, install emacs, maven, firefox:
     637
     638   sudo apt-get install emacs
     639
     640   sudo apt update
     641   sudo apt install maven
     642
     643   sudo apt-get -y install firefox
     644
     645f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here.
     646
     647After untarring the nutch 2.3.1 source tarball,
     648  1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig
     649  2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf
     650and put them into the apache-nutch-2.3.1/conf folder.
     651  3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2).
     652     - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch
     653     - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end.
     654
     655------------------------------------------------------------------------
     656J. Automated crawling with Nutch v2.3.1 and post-processing
     657------------------------------------------------------------------------
     6581. When you're ready to start crawling with Nutch 2.3.1,
     659- copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable.
     660- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel.
     661- run batchcrawl.sh on a site or range of sites not yet crawled, e.g.
     662    ./batchcrawl.sh 00485-00500
     663
     6642. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java.
     665
    615666
    616667
Note: See TracChangeset for help on using the changeset viewer.