root/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt @ 30913

Revision 30913, 2.5 KB (checked in by davidb, 4 years ago)

Renaming to better represent what the cluster is designed for

Line 
1
2Vargrant provisioning files to spin up a modest Spark cluster (master
3+ 3 slaves + backup) for experiments processing HTRC Extracted Feature
4JSON files suitable for ingesting into Solr.
5
6To aid parallelism, code is designed to read JSON files from HDFS, so
7the provision of the cluster includes Hadoop core in addition to Spark
8
9
10Provisioning uses Puppet scripting, based on the following on-line
11resources, but updated to use newer versions of Ubuntu, Java,
12and Hadoop.  Spark is then added in on top of that.
13
14
15  http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
16
17  https://github.com/calo81/vagrant-hadoop-cluster
18
19To get everything setup, type:
20
21  vargrant up
22
23Then log in to the master node, and swithc to 'ubuntu' user
24
25  vargrant ssh master
26  sudo su - ubuntu
27
28If the first time, you need to format an HDFS area to use:
29  hdfs namenode -format
30
31Otherwise start up HDFS and Spark deamon processes:
32
33  start-dfs.sh
34  spark-start-all.sh
35
36You can visit the Spark cluster monitoring page at:
37
38  http://10.10.0.52:8080/
39
40
41
42Supporting Resources
43====================
44
45----
46Basic Hadoop Cluster
47----
48
49Useful documentation about setting up a Hadoop cluster, read:
50
51  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
52then
53  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
54
55OR
56
57  https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
58then
59  https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
60
61For working with newer Linux OS and version of software:
62
63  http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
64
65----
66Hadoop + Apache Ambari in 3 lines:
67----
68
69  https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
70
71but looks like a fairly old version of software (currently unused).
72
73----
74Vagrant
75----
76
77To get rid of 'Guest Additions' warnins (about potentially
78incompatible version numbers) use 'vbguest' plugin:
79
80  vagrant plugin install vagrant-vbguest
81
82For more details see:
83
84http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
85
86----
87SecondaryNode
88----
89
90http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
91
92<property>
93  <name>dfs.namenode.secondary.http-address</name>
94  <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
95</property>
96
97----
98Spark Cluster
99----
100
101http://spark.apache.org/docs/latest/spark-standalone.html
Note: See TracBrowser for help on using the browser.