Context Navigation

NOTES-AND-SOURCES.txt

Last change on this file was 30914, checked in by davidb, 7 years ago
Tidy up of setup description
File size: 4.0 KB

Line
1
2	In learning about running Spark and Hadoop on a cluster, the following resources were found to be useful.
3
4	----
5	Setting up a Cluster using Vagrant and Puppet
6	----
7
8	Provisioning uses Puppet scripting, based on the following on-line
9	resources, but updated to use newer versions of Ubuntu, Java,
10	and Hadoop. Spark is then added in on top of that.
11
12
13	http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
14
15	https://github.com/calo81/vagrant-hadoop-cluster
16
17	----
18	Basic Hadoop Cluster setup manually
19	----
20
21	Useful documentation about setting up a Hadoop cluster, read:
22
23	http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
24	then
25	http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
26
27	OR
28
29	https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
30	then
31	https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
32
33	For working with newer Linux OS and version of software:
34
35	http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
36
37	----
38	Hadoop + Apache Ambari in 3 lines:
39	----
40
41	https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
42
43	but looks like a fairly old version of software (currently unused).
44
45	----
46	Vagrant
47	----
48
49	To get rid of 'Guest Additions' warnings (about potentially
50	incompatible version numbers) use 'vbguest' plugin:
51
52	vagrant plugin install vagrant-vbguest
53
54	For more details see:
55
56	http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
57
58	----
59	SecondaryNode
60	----
61
62	http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
63
64	<property>
65	<name>dfs.namenode.secondary.http-address</name>
66	<value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
67	</property>
68
69	----
70	Spark Cluster
71	----
72
73	http://spark.apache.org/docs/latest/spark-standalone.html
74
75
76
77
78
79
80
81
82
83	----
84	Introduction
85	----
86
87	Vargrant provisioning files to spin up a modest Spark cluster (master
88	+ 3 slaves + backup) for experiments in processing HTRC Extracted
89	Feature JSON files in parallel, suitable for ingesting into Solr.
90
91
92	Assumptions: You have VirtualBox and Vagrant installed
93
94
95	This is a 2 step process:
96
97	Step 1: Setting up the cluster
98	Step 2: Checking out the Java code to processing the JSON files
99
100
101	Step 1 is covered by this README file, ending with an svn checkout of
102	the Java code on the 'master' node that processes the JSON files. The
103	files checked out includes the README file covering Step 2.
104
105	----
106	Step 1
107	----
108
109	From within the directory this README.txt is located enter:
110
111	vagrant up
112
113	The first time this is run, there is a lot of downloading and setup to
114	do. Subsequent use of this command spins the cluster up much faster.
115
116	Once the cluster is set up, you need to get the Spark framework up and
117	running, which in turn uses Hadoop's HDFS. You do this as the user
118	'htrc' on the 'master' node:
119
120	vagrant ssh master
121	sudo su - htrc
122
123	If the first time, you need to format an HDFS area to use:
124
125	hdfs namenode -format
126
127	Otherwise start up HDFS and Spark deamon processes:
128
129	start-dfs.sh
130	spark-start-all.sh
131
132	You can visit the Spark cluster monitoring page at:
133
134	http://10.10.0.52:8080/
135
136	----
137	Getting ready for Step 2
138	----
139
140	With the Spark cluster with HDFS up and running, you are now ready to
141	proceed to Step 2, running the JSON processing code.
142
143
144	There are a couple of packages the 'master' node needs for this ('svn'
145	and 'mvn'), which we install as the 'vagrant' user. Then we are in a
146	position to check out the Java code, which in turn includes the README
147	file for Step 2.
148
149	Install subversion and maven as using the 'vagrant' user's sudo ability:
150
151	vagrant ssh master
152	sudo apt-get install subversion
153	sudo apt-get install maven
154
155	Now switch from the 'vagrant' user to 'htrc' and check out the Java code:
156
157	sudo su - htrc
158
159	svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features
160
161	Now follow the README file for Step 2:
162
163	cd solr-extracted-features
164	less README.txt
165
166	----
167
168
169
170
171	----
172

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/hathitrust/wcsa/extracted-features-solr/trunk/vagrant-spark-hdfs-cluster/NOTES-AND-SOURCES.txt

Download in other formats: