- Timestamp:
- 2016-10-25T11:27:10+13:00 (7 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
r30913 r30914 1 2 ---- 3 Introduction 4 ---- 1 5 2 6 Vargrant provisioning files to spin up a modest Spark cluster (master 3 + 3 slaves + backup) for experiments processing HTRC Extracted Feature 4 JSON files suitable for ingesting into Solr. 5 6 To aid parallelism, code is designed to read JSON files from HDFS, so 7 the provision of the cluster includes Hadoop core in addition to Spark 7 + 3 slaves + backup) for experiments in processing HTRC Extracted 8 Feature JSON files in parallel, suitable for ingesting into Solr. 8 9 9 10 10 Provisioning uses Puppet scripting, based on the following on-line 11 resources, but updated to use newer versions of Ubuntu, Java, 12 and Hadoop. Spark is then added in on top of that. 11 *Assumptions* 12 13 * You have VirtualBox and Vagrant installed 14 (at time of writing VirtualBox v5.0.28, Vagrant 1.8.6) 15 16 17 *Useful* 18 19 * Installing the Vagrant VirutalBox Guest Additions plugin to stop warnings 20 about potentially incompatible versions: 21 22 vagrant plugin install vagrant-vbguest 13 23 14 24 15 http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html 25 ---- 26 Setup Procedure 27 ---- 16 28 17 https://github.com/calo81/vagrant-hadoop-cluster 29 This is a 2 step process: 18 30 19 To get everything setup, type: 31 Step 1: Setting up the cluster 32 Step 2: Checking out the Java code to processing the JSON files 20 33 21 vargrant up22 34 23 Then log in to the master node, and swithc to 'ubuntu' user 35 Step 1 is covered by this README file, ending with an svn checkout of 36 the Java code on the 'master' node that processes the JSON files. The 37 files checked out includes the README file covering Step 2. 38 39 ---- 40 Step 1 41 ---- 24 42 25 vargrant ssh master 26 sudo su - ubuntu 43 From within the directory this README.txt is located enter: 44 45 vagrant up 46 47 The first time this is run, there is a lot of downloading and setup to 48 do. Subsequent use of this command spins the cluster up much faster. 49 50 Once the cluster is set up, you need to get the Spark framework up and 51 running, which in turn uses Hadoop's HDFS. You do this as the user 52 'htrc' on the 'master' node: 53 54 vagrant ssh master 55 sudo su - htrc 27 56 28 57 If the first time, you need to format an HDFS area to use: 58 29 59 hdfs namenode -format 30 60 … … 38 68 http://10.10.0.52:8080/ 39 69 70 ---- 71 Getting ready for Step 2 72 ---- 73 74 With the Spark cluster with HDFS up and running, you are now ready to 75 proceed to Step 2, running the JSON processing code. 40 76 41 77 42 Supporting Resources 43 ==================== 78 There are a couple of packages the 'master' node needs for this ('svn' 79 and 'mvn'), which we install as the 'vagrant' user. Then we are in a 80 position to check out the Java code, which in turn includes the README 81 file for Step 2. 82 83 Install subversion and maven as using the 'vagrant' user's sudo ability: 84 85 vagrant ssh master 86 sudo apt-get install subversion 87 sudo apt-get install maven 88 89 Now switch from the 'vagrant' user to 'htrc' and check out the Java code: 90 91 sudo su - htrc 92 93 svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features 94 95 Now follow the README file for Step 2: 96 97 cd solr-extracted-features 98 less README.txt 44 99 45 100 ---- 46 Basic Hadoop Cluster47 ----48 101 49 Useful documentation about setting up a Hadoop cluster, read:50 102 51 http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-single-node-cluster-setup-on.html52 then53 http://chaalpritam.blogspot.co.nz/2015/05/hadoop-270-multi-node-cluster-setup-on.html54 103 55 OR56 104 57 https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html58 then59 https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html60 105 61 For working with newer Linux OS and version of software:62 63 http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php64 65 ----66 Hadoop + Apache Ambari in 3 lines:67 ----68 69 https://blog.codecentric.de/en/2014/04/hadoop-cluster-automation/70 71 but looks like a fairly old version of software (currently unused).72 73 ----74 Vagrant75 ----76 77 To get rid of 'Guest Additions' warnins (about potentially78 incompatible version numbers) use 'vbguest' plugin:79 80 vagrant plugin install vagrant-vbguest81 82 For more details see:83 84 http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/85 86 ----87 SecondaryNode88 ----89 90 http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node91 92 <property>93 <name>dfs.namenode.secondary.http-address</name>94 <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>95 </property>96 97 ----98 Spark Cluster99 ----100 101 http://spark.apache.org/docs/latest/spark-standalone.html
Note:
See TracChangeset
for help on using the changeset viewer.