Vargrant provisioning files to spin up a modest Spark cluster (master
+ 3 slaves + backup) for experiments processing HTRC Extracted Feature
JSON files suitable for ingesting into Solr.

To aid parallelism, code is designed to read JSON files from HDFS, so
the provision of the cluster includes Hadoop core in addition to Spark


Provisioning uses Puppet scripting, based on the following on-line
resources, but updated to use newer versions of Ubuntu, Java,
and Hadoop.  Spark is then added in on top of that.


  http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html

  https://github.com/calo81/vagrant-hadoop-cluster

To get everything setup, type:

  vargrant up

Then log in to the master node, and swithc to 'ubuntu' user

  vargrant ssh master
  sudo su - ubuntu

If the first time, you need to format an HDFS area to use:
  hdfs namenode -format

Otherwise start up HDFS and Spark deamon processes:

  start-dfs.sh
  spark-start-all.sh

You can visit the Spark cluster monitoring page at:

  http://10.10.0.52:8080/


Supporting Resources
====================

----
Basic Hadoop Cluster
----

Useful documentation about setting up a Hadoop cluster, read:

  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
then
  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html

OR

  https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
then
  https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html

For working with newer Linux OS and version of software:

  http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php

----
Hadoop + Apache Ambari in 3 lines:
----

  https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/

but looks like a fairly old version of software (currently unused).

----
Vagrant
----

To get rid of 'Guest Additions' warnins (about potentially
incompatible version numbers) use 'vbguest' plugin:

  vagrant plugin install vagrant-vbguest

For more details see:

http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/

----
SecondaryNode
----

http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node

<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
</property>

----
Spark Cluster
----

http://spark.apache.org/docs/latest/spark-standalone.html