Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt@ 30913

Last change on this file since 30913 was 30913, checked in by davidb, 8 years ago
Renaming to better represent what the cluster is designed for
File size: 2.5 KB

Line
1
2	Vargrant provisioning files to spin up a modest Spark cluster (master
3	+ 3 slaves + backup) for experiments processing HTRC Extracted Feature
4	JSON files suitable for ingesting into Solr.
5
6	To aid parallelism, code is designed to read JSON files from HDFS, so
7	the provision of the cluster includes Hadoop core in addition to Spark
8
9
10	Provisioning uses Puppet scripting, based on the following on-line
11	resources, but updated to use newer versions of Ubuntu, Java,
12	and Hadoop. Spark is then added in on top of that.
13
14
15	http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
16
17	https://github.com/calo81/vagrant-hadoop-cluster
18
19	To get everything setup, type:
20
21	vargrant up
22
23	Then log in to the master node, and swithc to 'ubuntu' user
24
25	vargrant ssh master
26	sudo su - ubuntu
27
28	If the first time, you need to format an HDFS area to use:
29	hdfs namenode -format
30
31	Otherwise start up HDFS and Spark deamon processes:
32
33	start-dfs.sh
34	spark-start-all.sh
35
36	You can visit the Spark cluster monitoring page at:
37
38	http://10.10.0.52:8080/
39
40
41
42	Supporting Resources
43	====================
44
45	----
46	Basic Hadoop Cluster
47	----
48
49	Useful documentation about setting up a Hadoop cluster, read:
50
51	http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
52	then
53	http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
54
55	OR
56
57	https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
58	then
59	https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
60
61	For working with newer Linux OS and version of software:
62
63	http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
64
65	----
66	Hadoop + Apache Ambari in 3 lines:
67	----
68
69	https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
70
71	but looks like a fairly old version of software (currently unused).
72
73	----
74	Vagrant
75	----
76
77	To get rid of 'Guest Additions' warnins (about potentially
78	incompatible version numbers) use 'vbguest' plugin:
79
80	vagrant plugin install vagrant-vbguest
81
82	For more details see:
83
84	http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
85
86	----
87	SecondaryNode
88	----
89
90	http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
91
92	<property>
93	<name>dfs.namenode.secondary.http-address</name>
94	<value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
95	</property>
96
97	----
98	Spark Cluster
99	----
100
101	http://spark.apache.org/docs/latest/spark-standalone.html

Note: See TracBrowser for help on using the repository browser.

Download in other formats: