root/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/NOTES-AND-SOURCES.txt @ 30914

Revision 30914, 4.0 KB (checked in by davidb, 4 years ago)

Tidy up of setup description

Line 
1
2In learning about running Spark and Hadoop on a cluster, the following resources were found to be useful.
3
4----
5Setting up a Cluster using Vagrant and Puppet
6----
7
8Provisioning uses Puppet scripting, based on the following on-line
9resources, but updated to use newer versions of Ubuntu, Java,
10and Hadoop.  Spark is then added in on top of that.
11
12
13  http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
14
15  https://github.com/calo81/vagrant-hadoop-cluster
16
17----
18Basic Hadoop Cluster setup manually
19----
20
21Useful documentation about setting up a Hadoop cluster, read:
22
23  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
24then
25  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
26
27OR
28
29  https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
30then
31  https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
32
33For working with newer Linux OS and version of software:
34
35  http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
36
37----
38Hadoop + Apache Ambari in 3 lines:
39----
40
41  https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
42
43but looks like a fairly old version of software (currently unused).
44
45----
46Vagrant
47----
48
49To get rid of 'Guest Additions' warnings (about potentially
50incompatible version numbers) use 'vbguest' plugin:
51
52  vagrant plugin install vagrant-vbguest
53
54For more details see:
55
56http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
57
58----
59SecondaryNode
60----
61
62http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
63
64<property>
65  <name>dfs.namenode.secondary.http-address</name>
66  <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
67</property>
68
69----
70Spark Cluster
71----
72
73http://spark.apache.org/docs/latest/spark-standalone.html
74
75
76
77
78
79
80
81
82
83----
84Introduction
85----
86
87Vargrant provisioning files to spin up a modest Spark cluster (master
88+ 3 slaves + backup) for experiments in processing HTRC Extracted
89Feature JSON files in parallel, suitable for ingesting into Solr.
90
91
92*Assumptions*: You have VirtualBox and Vagrant installed
93
94
95This is a 2 step process:
96
97  Step 1: Setting up the cluster
98  Step 2: Checking out the Java code to processing the JSON files
99
100
101Step 1 is covered by this README file, ending with an svn checkout of
102the Java code on the 'master' node that processes the JSON files.  The
103files checked out includes the README file covering Step 2.
104     
105----
106Step 1
107----
108
109From within the directory this README.txt is located enter:
110
111  vagrant up
112
113The first time this is run, there is a lot of downloading and setup to
114do.  Subsequent use of this command spins the cluster up much faster.
115
116Once the cluster is set up, you need to get the Spark framework up and
117running, which in turn uses Hadoop's HDFS.  You do this as the user
118'htrc' on the 'master' node:
119
120  vagrant ssh master
121  sudo su - htrc
122
123If the first time, you need to format an HDFS area to use:
124
125  hdfs namenode -format
126
127Otherwise start up HDFS and Spark deamon processes:
128
129  start-dfs.sh
130  spark-start-all.sh
131
132You can visit the Spark cluster monitoring page at:
133
134  http://10.10.0.52:8080/
135
136----
137Getting ready for Step 2
138----
139
140With the Spark cluster with HDFS up and running, you are now ready to
141proceed to Step 2, running the JSON processing code.
142
143
144There are a couple of packages the 'master' node needs for this ('svn'
145and 'mvn'), which we install as the 'vagrant' user.  Then we are in a
146position to check out the Java code, which in turn includes the README
147file for Step 2.
148
149Install subversion and maven as using the 'vagrant' user's sudo ability:
150
151  vagrant ssh master
152  sudo apt-get install subversion
153  sudo apt-get install maven
154
155Now switch from the 'vagrant' user to 'htrc' and check out the Java code:
156
157  sudo su - htrc
158
159  svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features
160 
161Now follow the README file for Step 2:
162
163  cd solr-extracted-features
164  less README.txt
165
166----
167
168
169
170
171----
172
Note: See TracBrowser for help on using the browser.