Changeset 33598 for gs3-extensions
- Timestamp:
- 2019-10-22T20:19:54+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection/hdfs-cc-work
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/hdfs-cc-work/GS_README.TXT
r33545 r33598 16 16 H. Austici's crawl - CLI to download web sites as WARCs, features basics to avoid crawler taps 17 17 I. Setting up Nutch v2 on its own Vagrant VM machine 18 J. Automated crawling with Nutch v2.3.1 and post-processing 18 19 19 20 ---------------------------------------- … … 601 602 ---------------------------------------------------- 602 603 1. Untar vagrant-for-nutch2.tar.gz 603 2. Follow the instructions in vagrant-for-nutch2/GS_README.txt604 2. Follow the instructions below. A copy is also in vagrant-for-nutch2/GS_README.txt 604 605 605 606 --- … … 613 614 (Another option was MongoDB instead of HBase, https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/, but that was not even covered in the apache nutch 2 installation guide.) 614 615 616 --- 617 Vagrant VM for Nutch2 618 --- 619 This vagrant virtual machine is based on https://github.com/martinprobson/vagrant-hadoop-hive-spark 620 621 However: 622 - It comes with the older versions of hadoop 2.5.2 and hbase 0.98.21, and no spark or hive or other packages. 623 - the VM is called node2 with IP 10.211.55.102 (instead of node1 with IP 10.211.55.101) 624 - Since not all packages are installed, fewer ports needed forwarding. And they're forwarded to portnumber+2 to not conflict with any vagrant VM that used the original vagrant image's forwarded port numbers. 625 - scripts/common.sh uses HBASE_ARCHIVE of the form "hbase-${HBASE_VERSION}-hadoop2-bin.tar.gz" (the -hadoop2 suffix is specific to v0.98.21) 626 - and hbase gets installed as /usr/local/hbase-$HBASE_VERSION-hadoop2 (again with the additional -hadoop2 suffix specific to v0.98.21) in scripts/setup-hbase.sh, so the symbolic link creation there needed to refer to a path of this form. 627 628 INSTRUCTIONS: 629 a. mostly follow the "Getting Started" instructions at https://github.com/martinprobson/vagrant-hadoop-hive-spark 630 b. but after step 3, replace the github cloned Vagrantfile, scripts and resources folders with their modified counterparts included in this zip file. 631 c. wherever the rest of that git page refers to "node1", IP "10.211.55.101" and specific port numbers, use instead "node2", IP "10.211.55.102" and the forwarded port numbers in the customised Vagrantfile. 632 If there's already a node2/if IP "10.211.55.102" vagrant VM set up, then adjust all files in the git cloned vagrant vm already modified by the contents of this vagrant-for-nutch2 folder as follows: 633 - increment all occurrences of node2 and "10.211.55.102" to node3 and IP "10.211.55.103", if not already taken, and 634 - in the Vagrantfile increment forwarded ports by another 2 or so from the highest port number values already in use by other vagrant VMs. 635 d. After doing "vagrant up --provider=virtualbox" to create the VM, do "vagrant ssh" or "vagrant ssh node<#>" (e.g. "vagrant ssh node2") for step 8 in the "Getting Started" section. 636 e. Inside the VM, install emacs, maven, firefox: 637 638 sudo apt-get install emacs 639 640 sudo apt update 641 sudo apt install maven 642 643 sudo apt-get -y install firefox 644 645 f. We set up nutch 2.3.1, which can be downloaded from https://archive.apache.org/dist/nutch/2.3.1/, as that version worked as per the nutch2 tutorial instructions with the configuration of specific versions of hadoop, hbase and gora for the vagrant VM described here. 646 647 After untarring the nutch 2.3.1 source tarball, 648 1. move apache-nutch-2.3.1/conf/regex-urlfilter.txt to apache-nutch-2.3.1/conf/regex-urlfilter.txt.orig 649 2. download the two files nutch-site.xml and regex-urlfilter.GS_TEMPLATE from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/conf 650 and put them into the apache-nutch-2.3.1/conf folder. 651 3. Then continue following the nutch tutorial 2 instructions at https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial to set up nutch2 (and apache-solr, if needed, but I didn't install apache-solr for nutch v2). 652 - nutch-site.xml has already been configured to do as much optimisation and speeding up of the crawling as we know about concerning nutch 653 - for each site that will be crawled, regex-urlfilter.GS_TEMPLATE will get copied as the live "regex-urlfilter.txt" file, and lines of regex filters will be appended to its end. 654 655 ------------------------------------------------------------------------ 656 J. Automated crawling with Nutch v2.3.1 and post-processing 657 ------------------------------------------------------------------------ 658 1. When you're ready to start crawling with Nutch 2.3.1, 659 - copy the batchcrawl.sh file (from http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts) into the vagrant machine at top level. Make the script executable. 660 - copy the to_crawl.tar.gz file containing the "to_crawl" folder (generated by CCWETProcessor.java running of the common-crawl downloaded data where MRI was the primary language) and put it into the vagrant machine at toplevel. 661 - run batchcrawl.sh on a site or range of sites not yet crawled, e.g. 662 ./batchcrawl.sh 00485-00500 663 664 2. When crawling is done, the above will have generated the "crawled" folder containing a subfolder for each of the crawled sites, e.g. subfolders 00485 to 00500. Each crawled site folder will contain a dump.txt with the text output of the site's web pages. The "crawled" folder with site subfolders each containing a dump.txt file can be processed with NutchTextDumpProcessor.java. 665 615 666 616 667
Note:
See TracChangeset
for help on using the changeset viewer.