[30903] | 1 |
|
---|
[30913] | 2 | Vargrant provisioning files to spin up a modest Spark cluster (master
|
---|
| 3 | + 3 slaves + backup) for experiments processing HTRC Extracted Feature
|
---|
| 4 | JSON files suitable for ingesting into Solr.
|
---|
[30903] | 5 |
|
---|
[30913] | 6 | To aid parallelism, code is designed to read JSON files from HDFS, so
|
---|
| 7 | the provision of the cluster includes Hadoop core in addition to Spark
|
---|
[30905] | 8 |
|
---|
[30903] | 9 |
|
---|
[30913] | 10 | Provisioning uses Puppet scripting, based on the following on-line
|
---|
| 11 | resources, but updated to use newer versions of Ubuntu, Java,
|
---|
| 12 | and Hadoop. Spark is then added in on top of that.
|
---|
[30903] | 13 |
|
---|
[30913] | 14 |
|
---|
[30903] | 15 | http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
|
---|
| 16 |
|
---|
| 17 | https://github.com/calo81/vagrant-hadoop-cluster
|
---|
[30904] | 18 |
|
---|
[30913] | 19 | To get everything setup, type:
|
---|
[30904] | 20 |
|
---|
[30913] | 21 | vargrant up
|
---|
| 22 |
|
---|
| 23 | Then log in to the master node, and swithc to 'ubuntu' user
|
---|
| 24 |
|
---|
| 25 | vargrant ssh master
|
---|
| 26 | sudo su - ubuntu
|
---|
| 27 |
|
---|
| 28 | If the first time, you need to format an HDFS area to use:
|
---|
| 29 | hdfs namenode -format
|
---|
| 30 |
|
---|
| 31 | Otherwise start up HDFS and Spark deamon processes:
|
---|
| 32 |
|
---|
| 33 | start-dfs.sh
|
---|
| 34 | spark-start-all.sh
|
---|
| 35 |
|
---|
| 36 | You can visit the Spark cluster monitoring page at:
|
---|
| 37 |
|
---|
| 38 | http://10.10.0.52:8080/
|
---|
| 39 |
|
---|
| 40 |
|
---|
| 41 |
|
---|
[30905] | 42 | Supporting Resources
|
---|
| 43 | ====================
|
---|
| 44 |
|
---|
| 45 | ----
|
---|
| 46 | Basic Hadoop Cluster
|
---|
| 47 | ----
|
---|
| 48 |
|
---|
[30904] | 49 | Useful documentation about setting up a Hadoop cluster, read:
|
---|
| 50 |
|
---|
| 51 | http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
|
---|
| 52 | then
|
---|
| 53 | http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
|
---|
| 54 |
|
---|
| 55 | OR
|
---|
| 56 |
|
---|
| 57 | https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
|
---|
| 58 | then
|
---|
| 59 | https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
|
---|
| 60 |
|
---|
[30905] | 61 | For working with newer Linux OS and version of software:
|
---|
[30904] | 62 |
|
---|
[30905] | 63 | http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
|
---|
[30904] | 64 |
|
---|
[30905] | 65 | ----
|
---|
| 66 | Hadoop + Apache Ambari in 3 lines:
|
---|
| 67 | ----
|
---|
| 68 |
|
---|
| 69 | https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
|
---|
| 70 |
|
---|
| 71 | but looks like a fairly old version of software (currently unused).
|
---|
| 72 |
|
---|
| 73 | ----
|
---|
| 74 | Vagrant
|
---|
| 75 | ----
|
---|
| 76 |
|
---|
| 77 | To get rid of 'Guest Additions' warnins (about potentially
|
---|
| 78 | incompatible version numbers) use 'vbguest' plugin:
|
---|
| 79 |
|
---|
| 80 | vagrant plugin install vagrant-vbguest
|
---|
| 81 |
|
---|
| 82 | For more details see:
|
---|
| 83 |
|
---|
| 84 | http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
|
---|
| 85 |
|
---|
[30913] | 86 | ----
|
---|
| 87 | SecondaryNode
|
---|
| 88 | ----
|
---|
| 89 |
|
---|
| 90 | http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
|
---|
| 91 |
|
---|
| 92 | <property>
|
---|
| 93 | <name>dfs.namenode.secondary.http-address</name>
|
---|
| 94 | <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
|
---|
| 95 | </property>
|
---|
| 96 |
|
---|
| 97 | ----
|
---|
| 98 | Spark Cluster
|
---|
| 99 | ----
|
---|
| 100 |
|
---|
| 101 | http://spark.apache.org/docs/latest/spark-standalone.html
|
---|