Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

source: other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt@ 30913

Last change on this file since 30913 was 30913, checked in by davidb, 7 years ago
Renaming to better represent what the cluster is designed for
File size: 2.5 KB

Rev	Line
[30903]	1
[30913]	2	Vargrant provisioning files to spin up a modest Spark cluster (master
	3	+ 3 slaves + backup) for experiments processing HTRC Extracted Feature
	4	JSON files suitable for ingesting into Solr.
[30903]	5
[30913]	6	To aid parallelism, code is designed to read JSON files from HDFS, so
	7	the provision of the cluster includes Hadoop core in addition to Spark
[30905]	8
[30903]	9
[30913]	10	Provisioning uses Puppet scripting, based on the following on-line
	11	resources, but updated to use newer versions of Ubuntu, Java,
	12	and Hadoop. Spark is then added in on top of that.
[30903]	13
[30913]	14
[30903]	15	http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
	16
	17	https://github.com/calo81/vagrant-hadoop-cluster
[30904]	18
[30913]	19	To get everything setup, type:
[30904]	20
[30913]	21	vargrant up
	22
	23	Then log in to the master node, and swithc to 'ubuntu' user
	24
	25	vargrant ssh master
	26	sudo su - ubuntu
	27
	28	If the first time, you need to format an HDFS area to use:
	29	hdfs namenode -format
	30
	31	Otherwise start up HDFS and Spark deamon processes:
	32
	33	start-dfs.sh
	34	spark-start-all.sh
	35
	36	You can visit the Spark cluster monitoring page at:
	37
	38	http://10.10.0.52:8080/
	39
	40
	41
[30905]	42	Supporting Resources
	43	====================
	44
	45	----
	46	Basic Hadoop Cluster
	47	----
	48
[30904]	49	Useful documentation about setting up a Hadoop cluster, read:
	50
	51	http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
	52	then
	53	http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
	54
	55	OR
	56
	57	https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
	58	then
	59	https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
	60
[30905]	61	For working with newer Linux OS and version of software:
[30904]	62
[30905]	63	http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
[30904]	64
[30905]	65	----
	66	Hadoop + Apache Ambari in 3 lines:
	67	----
	68
	69	https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
	70
	71	but looks like a fairly old version of software (currently unused).
	72
	73	----
	74	Vagrant
	75	----
	76
	77	To get rid of 'Guest Additions' warnins (about potentially
	78	incompatible version numbers) use 'vbguest' plugin:
	79
	80	vagrant plugin install vagrant-vbguest
	81
	82	For more details see:
	83
	84	http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
	85
[30913]	86	----
	87	SecondaryNode
	88	----
	89
	90	http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
	91
	92	<property>
	93	<name>dfs.namenode.secondary.http-address</name>
	94	<value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
	95	</property>
	96
	97	----
	98	Spark Cluster
	99	----
	100
	101	http://spark.apache.org/docs/latest/spark-standalone.html

Note: See TracBrowser for help on using the repository browser.

Download in other formats: