---- Introduction ---- Vargrant provisioning files to spin up a modest Spark cluster (master + 3 slaves + backup) for experiments in processing HTRC Extracted Feature JSON files in parallel, suitable for ingesting into Solr. *Assumptions* * You have VirtualBox and Vagrant installed (at time of writing VirtualBox v5.0.28, Vagrant 1.8.6) *Useful* * Installing the Vagrant VirutalBox Guest Additions plugin to stop warnings about potentially incompatible versions: vagrant plugin install vagrant-vbguest ---- Setup Procedure ---- This is a 2 step process: Step 1: Setting up the cluster Step 2: Checking out the Java code to processing the JSON files Step 1 is covered by this README file, ending with an svn checkout of the Java code on the 'master' node that processes the JSON files. The files checked out includes the README file covering Step 2. ---- Step 1 ---- From within the directory this README.txt is located enter: vagrant up The first time this is run, there is a lot of downloading and setup to do. Subsequent use of this command spins the cluster up much faster. Once the cluster is set up, you need to get the Spark framework up and running, which in turn uses Hadoop's HDFS. You do this as the user 'htrc' on the 'master' node: vagrant ssh master sudo su - htrc If the first time, you need to format an HDFS area to use: hdfs namenode -format Otherwise start up HDFS and Spark deamon processes: start-dfs.sh spark-start-all.sh You can visit the Spark cluster monitoring page at: http://10.10.0.52:8080/ ---- Getting ready for Step 2 ---- With the Spark cluster with HDFS up and running, you are now ready to proceed to Step 2, running the JSON processing code. There are a couple of packages the 'master' node needs for this ('svn' and 'mvn'), which we install as the 'vagrant' user. Then we are in a position to check out the Java code, which in turn includes the README file for Step 2. Install subversion and maven as using the 'vagrant' user's sudo ability: vagrant ssh master sudo apt-get install subversion sudo apt-get install maven Now switch from the 'vagrant' user to 'htrc' and check out the Java code: sudo su - htrc svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features Now follow the README file for Step 2: cd solr-extracted-features less README.txt ----