In learning about running Spark and Hadoop on a cluster, the following resources were found to be useful. ---- Setting up a Cluster using Vagrant and Puppet ---- Provisioning uses Puppet scripting, based on the following on-line resources, but updated to use newer versions of Ubuntu, Java, and Hadoop. Spark is then added in on top of that. http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html https://github.com/calo81/vagrant-hadoop-cluster ---- Basic Hadoop Cluster setup manually ---- Useful documentation about setting up a Hadoop cluster, read: http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html then http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html OR https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html then https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html For working with newer Linux OS and version of software: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php ---- Hadoop + Apache Ambari in 3 lines: ---- https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/ but looks like a fairly old version of software (currently unused). ---- Vagrant ---- To get rid of 'Guest Additions' warnings (about potentially incompatible version numbers) use 'vbguest' plugin: vagrant plugin install vagrant-vbguest For more details see: http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/ ---- SecondaryNode ---- http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node dfs.namenode.secondary.http-address ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090 ---- Spark Cluster ---- http://spark.apache.org/docs/latest/spark-standalone.html ---- Introduction ---- Vargrant provisioning files to spin up a modest Spark cluster (master + 3 slaves + backup) for experiments in processing HTRC Extracted Feature JSON files in parallel, suitable for ingesting into Solr. *Assumptions*: You have VirtualBox and Vagrant installed This is a 2 step process: Step 1: Setting up the cluster Step 2: Checking out the Java code to processing the JSON files Step 1 is covered by this README file, ending with an svn checkout of the Java code on the 'master' node that processes the JSON files. The files checked out includes the README file covering Step 2. ---- Step 1 ---- From within the directory this README.txt is located enter: vagrant up The first time this is run, there is a lot of downloading and setup to do. Subsequent use of this command spins the cluster up much faster. Once the cluster is set up, you need to get the Spark framework up and running, which in turn uses Hadoop's HDFS. You do this as the user 'htrc' on the 'master' node: vagrant ssh master sudo su - htrc If the first time, you need to format an HDFS area to use: hdfs namenode -format Otherwise start up HDFS and Spark deamon processes: start-dfs.sh spark-start-all.sh You can visit the Spark cluster monitoring page at: http://10.10.0.52:8080/ ---- Getting ready for Step 2 ---- With the Spark cluster with HDFS up and running, you are now ready to proceed to Step 2, running the JSON processing code. There are a couple of packages the 'master' node needs for this ('svn' and 'mvn'), which we install as the 'vagrant' user. Then we are in a position to check out the Java code, which in turn includes the README file for Step 2. Install subversion and maven as using the 'vagrant' user's sudo ability: vagrant ssh master sudo apt-get install subversion sudo apt-get install maven Now switch from the 'vagrant' user to 'htrc' and check out the Java code: sudo su - htrc svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features Now follow the README file for Step 2: cd solr-extracted-features less README.txt ---- ----