Changeset 33598

22.10.2019 20:19:54 (3 weeks ago)

More instructions on setting up Nutch now that I've remembered to commit the prepared conf files. I've also added the instructions into the top-level GS_README here, since it was a pain untarring the vagrant-for-nutch2 tarball each time just to read the instructions I included in it.

2 modified


  • gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33545 r33598  
    1616H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 
    1717I. Setting up Nutch v2 on its own Vagrant VM machine 
     18J. Automated crawling with Nutch v2.3.1 and post-processing 
    6026031. Untar vagrant-for-nutch2.tar.gz 
    603 2. Follow the instructions in vagrant-for-nutch2/GS_README.txt 
     6042. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt 
    613614(Another option was MongoDB instead of HBase,, but that was not even covered in the apache nutch 2 installation guide.) 
     617 Vagrant VM for Nutch2 
     619This vagrant virtual machine is based on 
     622- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages. 
     623- the VM is called node2 with IP (instead of node1 with IP 
     624- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers. 
     625- scripts/ uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21) 
     626- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/, so the symbolic link creation there needed to refer to a path of this form. 
     629a. mostly follow the "Getting Started" instructions at 
     630b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in this zip file. 
     631c. wherever the rest of that git page refers to "node1", IP "" and specific port numbers, use instead "node2", IP "" and the forwarded port numbers in the customised Vagrantfile. 
     632If there's already a node2/if IP "" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows: 
     633- increment all occurrences of node2 and "" to node3 and IP "", if not already taken, and 
     634- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs. 
     635d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section. 
     636e. Inside the VM, install emacs, maven, firefox: 
     638   sudo apt-get install emacs 
     640   sudo apt update 
     641   sudo apt install maven 
     643   sudo apt-get -y install firefox 
     645f. We set up nutch 2.3.1, which can be downloaded from, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here. 
     647After untarring the nutch 2.3.1 source tarball, 
     648  1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig 
     649  2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from 
     650and put them into the apache-nutch-2.3.1/conf folder. 
     651  3. Then continue following the nutch tutorial 2 instructions at to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2). 
     652     - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch 
     653     - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end. 
     656J. Automated crawling with Nutch v2.3.1 and post-processing 
     6581. When you're ready to start crawling with Nutch 2.3.1, 
     659- copy the file (from into the vagrant machine at top level. Make the script executable. 
     660- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel. 
     661- run on a site or range of sites not yet crawled, e.g. 
     662    ./ 00485-00500 
     6642. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with