- Timestamp:
- 2016-10-25T10:02:58+13:00 (7 years ago)
- Location:
- other-projects/hathitrust/vagrant-spark-hdfs-cluster
- Files:
-
- 1 edited
- 1 moved
Legend:
- Unmodified
- Added
- Removed
-
other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
r30905 r30913 1 1 2 Vargrant provisioning files to spin up a modest (4 node) Hadoop 3 cluster for experiments processing HTRC Extracted Feature JSON files 4 suitable for ingesting into Solr. 2 Vargrant provisioning files to spin up a modest Spark cluster (master 3 + 3 slaves + backup) for experiments processing HTRC Extracted Feature 4 JSON files suitable for ingesting into Solr. 5 6 To aid parallelism, code is designed to read JSON files from HDFS, so 7 the provision of the cluster includes Hadoop core in addition to Spark 5 8 6 9 7 Top-level code Apache Spark, processing HDFS stored JSON files, hence 8 the need for an underlying Hadoop cluster. 10 Provisioning uses Puppet scripting, based on the following on-line 11 resources, but updated to use newer versions of Ubuntu, Java, 12 and Hadoop. Spark is then added in on top of that. 9 13 10 Provisioning based on the following online resources, but updated to11 use newer versions of Ubuntu, Java, and Hadoop.12 14 13 15 http://cscarioni.blogspot.co.nz/2012/09/setting-up-hadoop-virtual-cluster-with.html 14 16 15 17 https://github.com/calo81/vagrant-hadoop-cluster 18 19 To get everything setup, type: 20 21 vargrant up 22 23 Then log in to the master node, and swithc to 'ubuntu' user 24 25 vargrant ssh master 26 sudo su - ubuntu 27 28 If the first time, you need to format an HDFS area to use: 29 hdfs namenode -format 30 31 Otherwise start up HDFS and Spark deamon processes: 32 33 start-dfs.sh 34 spark-start-all.sh 35 36 You can visit the Spark cluster monitoring page at: 37 38 http://10.10.0.52:8080/ 39 16 40 17 41 … … 60 84 http://kvz.io/blog/2013/01/16/vagrant-tip-keep-virtualbox-guest-additions-in-sync/ 61 85 86 ---- 87 SecondaryNode 88 ---- 89 90 http://stackoverflow.com/questions/23581425/hadoop-how-to-start-secondary-namenode-on-other-node 91 92 <property> 93 <name>dfs.namenode.secondary.http-address</name> 94 <value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value> 95 </property> 96 97 ---- 98 Spark Cluster 99 ---- 100 101 http://spark.apache.org/docs/latest/spark-standalone.html
Note:
See TracChangeset
for help on using the changeset viewer.