Ignore:
Timestamp:
10/25/16 11:27:10 (4 years ago)
Author:
davidb
Message:

Tidy up of setup description

Location:
other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt

    r30913 r30914  
     1
     2----
     3Introduction
     4----
    15
    26Vargrant provisioning files to spin up a modest Spark cluster (master
    3 + 3 slaves + backup) for experiments processing HTRC Extracted Feature
    4 JSON files suitable for ingesting into Solr.
    5 
    6 To aid parallelism, code is designed to read JSON files from HDFS, so
    7 the provision of the cluster includes Hadoop core in addition to Spark
     7+ 3 slaves + backup) for experiments in processing HTRC Extracted
     8Feature JSON files in parallel, suitable for ingesting into Solr.
    89
    910
    10 Provisioning uses Puppet scripting, based on the following on-line
    11 resources, but updated to use newer versions of Ubuntu, Java,
    12 and Hadoop.  Spark is then added in on top of that.
     11*Assumptions*
     12
     13  * You have VirtualBox and Vagrant installed
     14    (at time of writing VirtualBox v5.0.28, Vagrant 1.8.6)
     15 
     16
     17*Useful*
     18
     19  * Installing the Vagrant VirutalBox Guest Additions plugin to stop warnings
     20    about potentially incompatible versions:
     21
     22      vagrant plugin install vagrant-vbguest
    1323
    1424
    15   http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
     25----
     26Setup Procedure
     27----
    1628
    17   https://github.com/calo81/vagrant-hadoop-cluster
     29This is a 2 step process:
    1830
    19 To get everything setup, type:
     31  Step 1: Setting up the cluster
     32  Step 2: Checking out the Java code to processing the JSON files
    2033
    21   vargrant up
    2234
    23 Then log in to the master node, and swithc to 'ubuntu' user
     35Step 1 is covered by this README file, ending with an svn checkout of
     36the Java code on the 'master' node that processes the JSON files.  The
     37files checked out includes the README file covering Step 2.
     38     
     39----
     40Step 1
     41----
    2442
    25   vargrant ssh master
    26   sudo su - ubuntu
     43From within the directory this README.txt is located enter:
     44
     45  vagrant up
     46
     47The first time this is run, there is a lot of downloading and setup to
     48do.  Subsequent use of this command spins the cluster up much faster.
     49
     50Once the cluster is set up, you need to get the Spark framework up and
     51running, which in turn uses Hadoop's HDFS.  You do this as the user
     52'htrc' on the 'master' node:
     53
     54  vagrant ssh master
     55  sudo su - htrc
    2756
    2857If the first time, you need to format an HDFS area to use:
     58
    2959  hdfs namenode -format
    3060
     
    3868  http://10.10.0.52:8080/
    3969
     70----
     71Getting ready for Step 2
     72----
     73
     74With the Spark cluster with HDFS up and running, you are now ready to
     75proceed to Step 2, running the JSON processing code.
    4076
    4177
    42 Supporting Resources
    43 ====================
     78There are a couple of packages the 'master' node needs for this ('svn'
     79and 'mvn'), which we install as the 'vagrant' user.  Then we are in a
     80position to check out the Java code, which in turn includes the README
     81file for Step 2.
     82
     83Install subversion and maven as using the 'vagrant' user's sudo ability:
     84
     85  vagrant ssh master
     86  sudo apt-get install subversion
     87  sudo apt-get install maven
     88
     89Now switch from the 'vagrant' user to 'htrc' and check out the Java code:
     90
     91  sudo su - htrc
     92
     93  svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features
     94 
     95Now follow the README file for Step 2:
     96
     97  cd solr-extracted-features
     98  less README.txt
    4499
    45100----
    46 Basic Hadoop Cluster
    47 ----
    48101
    49 Useful documentation about setting up a Hadoop cluster, read:
    50102
    51   http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
    52 then
    53   http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
    54103
    55 OR
    56104
    57   https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
    58 then
    59   https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
    60105
    61 For working with newer Linux OS and version of software:
    62 
    63   http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
    64 
    65 ----
    66 Hadoop + Apache Ambari in 3 lines:
    67 ----
    68 
    69   https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
    70 
    71 but looks like a fairly old version of software (currently unused).
    72 
    73 ----
    74 Vagrant
    75 ----
    76 
    77 To get rid of 'Guest Additions' warnins (about potentially
    78 incompatible version numbers) use 'vbguest' plugin:
    79 
    80   vagrant plugin install vagrant-vbguest
    81 
    82 For more details see:
    83 
    84 http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
    85 
    86 ----
    87 SecondaryNode
    88 ----
    89 
    90 http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
    91 
    92 <property>
    93   <name>dfs.namenode.secondary.http-address</name>
    94   <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
    95 </property>
    96 
    97 ----
    98 Spark Cluster
    99 ----
    100 
    101 http://spark.apache.org/docs/latest/spark-standalone.html
Note: See TracChangeset for help on using the changeset viewer.