source: other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt@ 30913

Last change on this file since 30913 was 30913, checked in by davidb, 7 years ago

Renaming to better represent what the cluster is designed for

File size: 2.5 KB
RevLine 
[30903]1
[30913]2Vargrant provisioning files to spin up a modest Spark cluster (master
3+ 3 slaves + backup) for experiments processing HTRC Extracted Feature
4JSON files suitable for ingesting into Solr.
[30903]5
[30913]6To aid parallelism, code is designed to read JSON files from HDFS, so
7the provision of the cluster includes Hadoop core in addition to Spark
[30905]8
[30903]9
[30913]10Provisioning uses Puppet scripting, based on the following on-line
11resources, but updated to use newer versions of Ubuntu, Java,
12and Hadoop. Spark is then added in on top of that.
[30903]13
[30913]14
[30903]15 http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
16
17 https://github.com/calo81/vagrant-hadoop-cluster
[30904]18
[30913]19To get everything setup, type:
[30904]20
[30913]21 vargrant up
22
23Then log in to the master node, and swithc to 'ubuntu' user
24
25 vargrant ssh master
26 sudo su - ubuntu
27
28If the first time, you need to format an HDFS area to use:
29 hdfs namenode -format
30
31Otherwise start up HDFS and Spark deamon processes:
32
33 start-dfs.sh
34 spark-start-all.sh
35
36You can visit the Spark cluster monitoring page at:
37
38 http://10.10.0.52:8080/
39
40
41
[30905]42Supporting Resources
43====================
44
45----
46Basic Hadoop Cluster
47----
48
[30904]49Useful documentation about setting up a Hadoop cluster, read:
50
51 http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
52then
53 http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
54
55OR
56
57 https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
58then
59 https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
60
[30905]61For working with newer Linux OS and version of software:
[30904]62
[30905]63 http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
[30904]64
[30905]65----
66Hadoop + Apache Ambari in 3 lines:
67----
68
69 https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
70
71but looks like a fairly old version of software (currently unused).
72
73----
74Vagrant
75----
76
77To get rid of 'Guest Additions' warnins (about potentially
78incompatible version numbers) use 'vbguest' plugin:
79
80 vagrant plugin install vagrant-vbguest
81
82For more details see:
83
84http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
85
[30913]86----
87SecondaryNode
88----
89
90http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
91
92<property>
93 <name>dfs.namenode.secondary.http-address</name>
94 <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
95</property>
96
97----
98Spark Cluster
99----
100
101http://spark.apache.org/docs/latest/spark-standalone.html
Note: See TracBrowser for help on using the repository browser.