Vargrant provisioning files to spin up a modest Spark cluster (master + 3 slaves + backup) for experiments processing HTRC Extracted Feature JSON files suitable for ingesting into Solr. To aid parallelism, code is designed to read JSON files from HDFS, so the provision of the cluster includes Hadoop core in addition to Spark Provisioning uses Puppet scripting, based on the following on-line resources, but updated to use newer versions of Ubuntu, Java, and Hadoop. Spark is then added in on top of that. http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html https://github.com/calo81/vagrant-hadoop-cluster To get everything setup, type: vargrant up Then log in to the master node, and swithc to 'ubuntu' user vargrant ssh master sudo su - ubuntu If the first time, you need to format an HDFS area to use: hdfs namenode -format Otherwise start up HDFS and Spark deamon processes: start-dfs.sh spark-start-all.sh You can visit the Spark cluster monitoring page at: http://10.10.0.52:8080/ Supporting Resources ==================== ---- Basic Hadoop Cluster ---- Useful documentation about setting up a Hadoop cluster, read: http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html then http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html OR https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html then https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html For working with newer Linux OS and version of software: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php ---- Hadoop + Apache Ambari in 3 lines: ---- https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/ but looks like a fairly old version of software (currently unused). ---- Vagrant ---- To get rid of 'Guest Additions' warnins (about potentially incompatible version numbers) use 'vbguest' plugin: vagrant plugin install vagrant-vbguest For more details see: http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/ ---- SecondaryNode ---- http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node dfs.namenode.secondary.http-address ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090 ---- Spark Cluster ---- http://spark.apache.org/docs/latest/spark-standalone.html