Changeset 30914 for other-projects

Show
Ignore:
Timestamp:
25.10.2016 11:27:10 (3 years ago)
Author:
davidb
Message:

Tidy up of setup description

Location:
other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk
Files:
1 added
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt

    r30913 r30914  
     1 
     2---- 
     3Introduction 
     4---- 
    15 
    26Vargrant provisioning files to spin up a modest Spark cluster (master 
    3 + 3 slaves + backup) for experiments processing HTRC Extracted Feature 
    4 JSON files suitable for ingesting into Solr. 
    5  
    6 To aid parallelism, code is designed to read JSON files from HDFS, so 
    7 the provision of the cluster includes Hadoop core in addition to Spark 
     7+ 3 slaves + backup) for experiments in processing HTRC Extracted 
     8Feature JSON files in parallel, suitable for ingesting into Solr. 
    89 
    910 
    10 Provisioning uses Puppet scripting, based on the following on-line 
    11 resources, but updated to use newer versions of Ubuntu, Java, 
    12 and Hadoop.  Spark is then added in on top of that. 
     11*Assumptions* 
     12 
     13  * You have VirtualBox and Vagrant installed 
     14    (at time of writing VirtualBox v5.0.28, Vagrant 1.8.6) 
     15   
     16 
     17*Useful* 
     18 
     19  * Installing the Vagrant VirutalBox Guest Additions plugin to stop warnings 
     20    about potentially incompatible versions: 
     21 
     22      vagrant plugin install vagrant-vbguest 
    1323 
    1424 
    15   http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html 
     25---- 
     26Setup Procedure 
     27---- 
    1628 
    17   https://github.com/calo81/vagrant-hadoop-cluster 
     29This is a 2 step process: 
    1830 
    19 To get everything setup, type: 
     31  Step 1: Setting up the cluster 
     32  Step 2: Checking out the Java code to processing the JSON files 
    2033 
    21   vargrant up 
    2234 
    23 Then log in to the master node, and swithc to 'ubuntu' user 
     35Step 1 is covered by this README file, ending with an svn checkout of 
     36the Java code on the 'master' node that processes the JSON files.  The 
     37files checked out includes the README file covering Step 2. 
     38       
     39---- 
     40Step 1 
     41---- 
    2442 
    25   vargrant ssh master 
    26   sudo su - ubuntu 
     43From within the directory this README.txt is located enter: 
     44 
     45  vagrant up 
     46 
     47The first time this is run, there is a lot of downloading and setup to 
     48do.  Subsequent use of this command spins the cluster up much faster. 
     49 
     50Once the cluster is set up, you need to get the Spark framework up and 
     51running, which in turn uses Hadoop's HDFS.  You do this as the user 
     52'htrc' on the 'master' node: 
     53 
     54  vagrant ssh master 
     55  sudo su - htrc 
    2756 
    2857If the first time, you need to format an HDFS area to use: 
     58 
    2959  hdfs namenode -format 
    3060 
     
    3868  http://10.10.0.52:8080/ 
    3969 
     70---- 
     71Getting ready for Step 2 
     72---- 
     73 
     74With the Spark cluster with HDFS up and running, you are now ready to 
     75proceed to Step 2, running the JSON processing code. 
    4076 
    4177 
    42 Supporting Resources 
    43 ==================== 
     78There are a couple of packages the 'master' node needs for this ('svn' 
     79and 'mvn'), which we install as the 'vagrant' user.  Then we are in a 
     80position to check out the Java code, which in turn includes the README 
     81file for Step 2. 
     82 
     83Install subversion and maven as using the 'vagrant' user's sudo ability: 
     84 
     85  vagrant ssh master 
     86  sudo apt-get install subversion 
     87  sudo apt-get install maven 
     88 
     89Now switch from the 'vagrant' user to 'htrc' and check out the Java code: 
     90 
     91  sudo su - htrc 
     92 
     93  svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features 
     94   
     95Now follow the README file for Step 2: 
     96 
     97  cd solr-extracted-features 
     98  less README.txt 
    4499 
    45100---- 
    46 Basic Hadoop Cluster 
    47 ---- 
    48101 
    49 Useful documentation about setting up a Hadoop cluster, read: 
    50102 
    51   http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html 
    52 then 
    53   http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html 
    54103 
    55 OR 
    56104 
    57   https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html 
    58 then 
    59   https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html 
    60105 
    61 For working with newer Linux OS and version of software: 
    62  
    63   http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php 
    64  
    65 ---- 
    66 Hadoop + Apache Ambari in 3 lines: 
    67 ---- 
    68  
    69   https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/ 
    70  
    71 but looks like a fairly old version of software (currently unused). 
    72  
    73 ---- 
    74 Vagrant 
    75 ---- 
    76  
    77 To get rid of 'Guest Additions' warnins (about potentially 
    78 incompatible version numbers) use 'vbguest' plugin: 
    79  
    80   vagrant plugin install vagrant-vbguest 
    81  
    82 For more details see: 
    83  
    84 http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/ 
    85  
    86 ---- 
    87 SecondaryNode 
    88 ---- 
    89  
    90 http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node 
    91  
    92 <property> 
    93   <name>dfs.namenode.secondary.http-address</name> 
    94   <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value> 
    95 </property> 
    96  
    97 ---- 
    98 Spark Cluster 
    99 ---- 
    100  
    101 http://spark.apache.org/docs/latest/spark-standalone.html