1 |
|
---|
2 | Vargrant provisioning files to spin up a modest Spark cluster (master
|
---|
3 | + 3 slaves + backup) for experiments processing HTRC Extracted Feature
|
---|
4 | JSON files suitable for ingesting into Solr.
|
---|
5 |
|
---|
6 | To aid parallelism, code is designed to read JSON files from HDFS, so
|
---|
7 | the provision of the cluster includes Hadoop core in addition to Spark
|
---|
8 |
|
---|
9 |
|
---|
10 | Provisioning uses Puppet scripting, based on the following on-line
|
---|
11 | resources, but updated to use newer versions of Ubuntu, Java,
|
---|
12 | and Hadoop. Spark is then added in on top of that.
|
---|
13 |
|
---|
14 |
|
---|
15 | http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
|
---|
16 |
|
---|
17 | https://github.com/calo81/vagrant-hadoop-cluster
|
---|
18 |
|
---|
19 | To get everything setup, type:
|
---|
20 |
|
---|
21 | vargrant up
|
---|
22 |
|
---|
23 | Then log in to the master node, and swithc to 'ubuntu' user
|
---|
24 |
|
---|
25 | vargrant ssh master
|
---|
26 | sudo su - ubuntu
|
---|
27 |
|
---|
28 | If the first time, you need to format an HDFS area to use:
|
---|
29 | hdfs namenode -format
|
---|
30 |
|
---|
31 | Otherwise start up HDFS and Spark deamon processes:
|
---|
32 |
|
---|
33 | start-dfs.sh
|
---|
34 | spark-start-all.sh
|
---|
35 |
|
---|
36 | You can visit the Spark cluster monitoring page at:
|
---|
37 |
|
---|
38 | http://10.10.0.52:8080/
|
---|
39 |
|
---|
40 |
|
---|
41 |
|
---|
42 | Supporting Resources
|
---|
43 | ====================
|
---|
44 |
|
---|
45 | ----
|
---|
46 | Basic Hadoop Cluster
|
---|
47 | ----
|
---|
48 |
|
---|
49 | Useful documentation about setting up a Hadoop cluster, read:
|
---|
50 |
|
---|
51 | http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
|
---|
52 | then
|
---|
53 | http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
|
---|
54 |
|
---|
55 | OR
|
---|
56 |
|
---|
57 | https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
|
---|
58 | then
|
---|
59 | https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
|
---|
60 |
|
---|
61 | For working with newer Linux OS and version of software:
|
---|
62 |
|
---|
63 | http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
|
---|
64 |
|
---|
65 | ----
|
---|
66 | Hadoop + Apache Ambari in 3 lines:
|
---|
67 | ----
|
---|
68 |
|
---|
69 | https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
|
---|
70 |
|
---|
71 | but looks like a fairly old version of software (currently unused).
|
---|
72 |
|
---|
73 | ----
|
---|
74 | Vagrant
|
---|
75 | ----
|
---|
76 |
|
---|
77 | To get rid of 'Guest Additions' warnins (about potentially
|
---|
78 | incompatible version numbers) use 'vbguest' plugin:
|
---|
79 |
|
---|
80 | vagrant plugin install vagrant-vbguest
|
---|
81 |
|
---|
82 | For more details see:
|
---|
83 |
|
---|
84 | http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
|
---|
85 |
|
---|
86 | ----
|
---|
87 | SecondaryNode
|
---|
88 | ----
|
---|
89 |
|
---|
90 | http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
|
---|
91 |
|
---|
92 | <property>
|
---|
93 | <name>dfs.namenode.secondary.http-address</name>
|
---|
94 | <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
|
---|
95 | </property>
|
---|
96 |
|
---|
97 | ----
|
---|
98 | Spark Cluster
|
---|
99 | ----
|
---|
100 |
|
---|
101 | http://spark.apache.org/docs/latest/spark-standalone.html
|
---|