source: other-projects/hathitrust/wcsa/extracted-features-solr/trunk/vagrant-spark-hdfs-cluster/NOTES-AND-SOURCES.txt

Last change on this file was 30914, checked in by davidb, 7 years ago

Tidy up of setup description

File size: 4.0 KB
Line 
1
2In learning about running Spark and Hadoop on a cluster, the following resources were found to be useful.
3
4----
5Setting up a Cluster using Vagrant and Puppet
6----
7
8Provisioning uses Puppet scripting, based on the following on-line
9resources, but updated to use newer versions of Ubuntu, Java,
10and Hadoop. Spark is then added in on top of that.
11
12
13 http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
14
15 https://github.com/calo81/vagrant-hadoop-cluster
16
17----
18Basic Hadoop Cluster setup manually
19----
20
21Useful documentation about setting up a Hadoop cluster, read:
22
23 http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
24then
25 http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
26
27OR
28
29 https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
30then
31 https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
32
33For working with newer Linux OS and version of software:
34
35 http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
36
37----
38Hadoop + Apache Ambari in 3 lines:
39----
40
41 https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
42
43but looks like a fairly old version of software (currently unused).
44
45----
46Vagrant
47----
48
49To get rid of 'Guest Additions' warnings (about potentially
50incompatible version numbers) use 'vbguest' plugin:
51
52 vagrant plugin install vagrant-vbguest
53
54For more details see:
55
56http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
57
58----
59SecondaryNode
60----
61
62http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
63
64<property>
65 <name>dfs.namenode.secondary.http-address</name>
66 <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
67</property>
68
69----
70Spark Cluster
71----
72
73http://spark.apache.org/docs/latest/spark-standalone.html
74
75
76
77
78
79
80
81
82
83----
84Introduction
85----
86
87Vargrant provisioning files to spin up a modest Spark cluster (master
88+ 3 slaves + backup) for experiments in processing HTRC Extracted
89Feature JSON files in parallel, suitable for ingesting into Solr.
90
91
92*Assumptions*: You have VirtualBox and Vagrant installed
93
94
95This is a 2 step process:
96
97 Step 1: Setting up the cluster
98 Step 2: Checking out the Java code to processing the JSON files
99
100
101Step 1 is covered by this README file, ending with an svn checkout of
102the Java code on the 'master' node that processes the JSON files. The
103files checked out includes the README file covering Step 2.
104
105----
106Step 1
107----
108
109From within the directory this README.txt is located enter:
110
111 vagrant up
112
113The first time this is run, there is a lot of downloading and setup to
114do. Subsequent use of this command spins the cluster up much faster.
115
116Once the cluster is set up, you need to get the Spark framework up and
117running, which in turn uses Hadoop's HDFS. You do this as the user
118'htrc' on the 'master' node:
119
120 vagrant ssh master
121 sudo su - htrc
122
123If the first time, you need to format an HDFS area to use:
124
125 hdfs namenode -format
126
127Otherwise start up HDFS and Spark deamon processes:
128
129 start-dfs.sh
130 spark-start-all.sh
131
132You can visit the Spark cluster monitoring page at:
133
134 http://10.10.0.52:8080/
135
136----
137Getting ready for Step 2
138----
139
140With the Spark cluster with HDFS up and running, you are now ready to
141proceed to Step 2, running the JSON processing code.
142
143
144There are a couple of packages the 'master' node needs for this ('svn'
145and 'mvn'), which we install as the 'vagrant' user. Then we are in a
146position to check out the Java code, which in turn includes the README
147file for Step 2.
148
149Install subversion and maven as using the 'vagrant' user's sudo ability:
150
151 vagrant ssh master
152 sudo apt-get install subversion
153 sudo apt-get install maven
154
155Now switch from the 'vagrant' user to 'htrc' and check out the Java code:
156
157 sudo su - htrc
158
159 svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features
160
161Now follow the README file for Step 2:
162
163 cd solr-extracted-features
164 less README.txt
165
166----
167
168
169
170
171----
172
Note: See TracBrowser for help on using the repository browser.