Context Navigation

← Previous Change
Next Change →

Changeset 33598 for gs3-extensions

Timestamp:

2019-10-22T20:19:54+13:00 (5 years ago)

Author:

ak19

Message:

More instructions on setting up Nutch now that I've remembered to commit the prepared conf files. I've also added the instructions into the top-level GS_README here, since it was a pain untarring the vagrant-for-nutch2 tarball each time just to read the instructions I included in it.

Location:

gs3-extensions/maori-lang-detection/hdfs-cc-work

Files:

: 2 edited

GS_README.TXT (modified) (3 diffs)
vagrant-for-nutch2.tar.gz (modified) ( previous)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

-              r33545
+              r33598
 H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps
 I. Setting up Nutch v2 on its own Vagrant VM machine
+J. Automated crawling with Nutch v2.3.1 and post-processing
 ----------------------------------------
 …
 ----------------------------------------------------
 . Untar vagrant-for-nutch2.tar.gz
 . Follow the instructions in vagrant-for-nutch2/GS_README.txt
+. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt
 ---
 …
 (Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.)
+---
+ Vagrant VM for Nutch2
+---
+This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark
+However:
+- It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages.
+- the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101)
+- Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers.
+- scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21)
+- and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form.
+INSTRUCTIONS:
+a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark
+b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in this zip file.
+c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile.
+If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows:
+- increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and
+- in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs.
+d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section.
+e. Inside the VM, install emacs, maven, firefox:
+   sudo apt-get install emacs
+   sudo apt update
+   sudo apt install maven
+   sudo apt-get -y install firefox
+f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here.
+After untarring the nutch 2.3.1 source tarball,
+. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig
+. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf
+and put them into the apache-nutch-2.3.1/conf folder.
+. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2).
+     - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch
+     - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end.
+------------------------------------------------------------------------
+J. Automated crawling with Nutch v2.3.1 and post-processing
+------------------------------------------------------------------------
+. When you're ready to start crawling with Nutch 2.3.1,
+- copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable.
+- copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel.
+- run batchcrawl.sh on a site or range of sites not yet crawled, e.g.
+    ./batchcrawl.sh 00485-00500
+. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33598 for gs3-extensions

Legend:

gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT

Download in other formats: