Context Navigation

← Previous Change
Next Change →

vagrant-spark-hdfs-cluster

Timestamp:

2016-10-25T11:27:10+13:00 (7 years ago)

Author:

davidb

Message:

Tidy up of setup description

Location:

other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk

Files:

: 1 added
: 1 edited

NOTES-AND-SOURCES.txt (added)
README.txt (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt

-              r30913
+              r30914
+----
+Introduction
+----
 Vargrant provisioning files to spin up a modest Spark cluster (master
++ 3 slaves + backup) for experiments processing HTRC Extracted Feature
+JSON files suitable for ingesting into Solr.
+To aid parallelism, code is designed to read JSON files from HDFS, so
+the provision of the cluster includes Hadoop core in addition to Spark
++ 3 slaves + backup) for experiments in processing HTRC Extracted
+Feature JSON files in parallel, suitable for ingesting into Solr.
+Provisioning uses Puppet scripting, based on the following on-line
+resources, but updated to use newer versions of Ubuntu, Java,
+and Hadoop.  Spark is then added in on top of that.
+*Assumptions*
+  * You have VirtualBox and Vagrant installed
+    (at time of writing VirtualBox v5.0.28, Vagrant 1.8.6)
+*Useful*
+  * Installing the Vagrant VirutalBox Guest Additions plugin to stop warnings
+    about potentially incompatible versions:
+      vagrant plugin install vagrant-vbguest
+  http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html
+----
+Setup Procedure
+----
+  https://github.com/calo81/vagrant-hadoop-cluster
+This is a 2 step process:
+To get everything setup, type:
+  Step 1: Setting up the cluster
+  Step 2: Checking out the Java code to processing the JSON files
-  vargrant up
+Then log in to the master node, and swithc to 'ubuntu' user
+Step 1 is covered by this README file, ending with an svn checkout of
+the Java code on the 'master' node that processes the JSON files.  The
+files checked out includes the README file covering Step 2.
+----
+Step 1
+----
+  vargrant ssh master
+  sudo su - ubuntu
+From within the directory this README.txt is located enter:
+  vagrant up
+The first time this is run, there is a lot of downloading and setup to
+do.  Subsequent use of this command spins the cluster up much faster.
+Once the cluster is set up, you need to get the Spark framework up and
+running, which in turn uses Hadoop's HDFS.  You do this as the user
+'htrc' on the 'master' node:
+  vagrant ssh master
+  sudo su - htrc
 If the first time, you need to format an HDFS area to use:
   hdfs namenode -format
 …
   http://10.10.0.52:8080/
+----
+Getting ready for Step 2
+----
+With the Spark cluster with HDFS up and running, you are now ready to
+proceed to Step 2, running the JSON processing code.
+Supporting Resources
+====================
+There are a couple of packages the 'master' node needs for this ('svn'
+and 'mvn'), which we install as the 'vagrant' user.  Then we are in a
+position to check out the Java code, which in turn includes the README
+file for Step 2.
+Install subversion and maven as using the 'vagrant' user's sudo ability:
+  vagrant ssh master
+  sudo apt-get install subversion
+  sudo apt-get install maven
+Now switch from the 'vagrant' user to 'htrc' and check out the Java code:
+  sudo su - htrc
+  svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features
+Now follow the README file for Step 2:
+  cd solr-extracted-features
+  less README.txt
 ----
-Basic Hadoop Cluster
-----
-Useful documentation about setting up a Hadoop cluster, read:
-  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html
-then
-  http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html
-OR
-  https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
-then
-  https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
-For working with newer Linux OS and version of software:
-  http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
-----
-Hadoop + Apache Ambari in 3 lines:
-----
-  https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/
-but looks like a fairly old version of software (currently unused).
-----
-Vagrant
-----
-To get rid of 'Guest Additions' warnins (about potentially
-incompatible version numbers) use 'vbguest' plugin:
-  vagrant plugin install vagrant-vbguest
-For more details see:
-http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/
-----
-SecondaryNode
-----
-http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node
-<property>
-  <name>dfs.namenode.secondary.http-address</name>
-  <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
-</property>
-----
-Spark Cluster
-----
-http://spark.apache.org/docs/latest/spark-standalone.html

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 30914 for other-projects/hathitrust/vagrant-spark-hdfs-cluster

Legend:

other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt

Download in other formats: