1 |
|
---|
2 | ----
|
---|
3 | Introduction
|
---|
4 | ----
|
---|
5 |
|
---|
6 | Vargrant provisioning files to spin up a modest Spark cluster (master
|
---|
7 | + 3 slaves + backup) for experiments in processing HTRC Extracted
|
---|
8 | Feature JSON files in parallel, suitable for ingesting into Solr.
|
---|
9 |
|
---|
10 |
|
---|
11 | *Assumptions*
|
---|
12 |
|
---|
13 | * You have VirtualBox and Vagrant installed
|
---|
14 | (at time of writing VirtualBox v5.0.28, Vagrant 1.8.6)
|
---|
15 |
|
---|
16 |
|
---|
17 | *Useful*
|
---|
18 |
|
---|
19 | * Installing the Vagrant VirutalBox Guest Additions plugin to stop warnings
|
---|
20 | about potentially incompatible versions:
|
---|
21 |
|
---|
22 | vagrant plugin install vagrant-vbguest
|
---|
23 |
|
---|
24 |
|
---|
25 | ----
|
---|
26 | Setup Procedure
|
---|
27 | ----
|
---|
28 |
|
---|
29 | This is a 2 step process:
|
---|
30 |
|
---|
31 | Step 1: Setting up the cluster
|
---|
32 | Step 2: Checking out the Java code to processing the JSON files
|
---|
33 |
|
---|
34 |
|
---|
35 | Step 1 is covered by this README file, ending with an svn checkout of
|
---|
36 | the Java code on the 'master' node that processes the JSON files. The
|
---|
37 | files checked out includes the README file covering Step 2.
|
---|
38 |
|
---|
39 | ----
|
---|
40 | Step 1
|
---|
41 | ----
|
---|
42 |
|
---|
43 | From within the directory this README.txt is located enter:
|
---|
44 |
|
---|
45 | vagrant up
|
---|
46 |
|
---|
47 | The first time this is run, there is a lot of downloading and setup to
|
---|
48 | do. Subsequent use of this command spins the cluster up much faster.
|
---|
49 |
|
---|
50 | Once the cluster is set up, you need to get the Spark framework up and
|
---|
51 | running, which in turn uses Hadoop's HDFS. You do this as the user
|
---|
52 | 'htrc' on the 'master' node:
|
---|
53 |
|
---|
54 | vagrant ssh master
|
---|
55 | sudo su - htrc
|
---|
56 |
|
---|
57 | If the first time, you need to format an HDFS area to use:
|
---|
58 |
|
---|
59 | hdfs namenode -format
|
---|
60 |
|
---|
61 | Otherwise start up HDFS and Spark deamon processes:
|
---|
62 |
|
---|
63 | start-dfs.sh
|
---|
64 | spark-start-all.sh
|
---|
65 |
|
---|
66 | You can visit the Spark cluster monitoring page at:
|
---|
67 |
|
---|
68 | http://10.10.0.52:8080/
|
---|
69 |
|
---|
70 | ----
|
---|
71 | Getting ready for Step 2
|
---|
72 | ----
|
---|
73 |
|
---|
74 | With the Spark cluster with HDFS up and running, you are now ready to
|
---|
75 | proceed to Step 2, running the JSON processing code.
|
---|
76 |
|
---|
77 |
|
---|
78 | There are a couple of packages the 'master' node needs for this ('svn'
|
---|
79 | and 'mvn'), which we install as the 'vagrant' user. Then we are in a
|
---|
80 | position to check out the Java code, which in turn includes the README
|
---|
81 | file for Step 2.
|
---|
82 |
|
---|
83 | Install subversion and maven as using the 'vagrant' user's sudo ability:
|
---|
84 |
|
---|
85 | vagrant ssh master
|
---|
86 | sudo apt-get install subversion
|
---|
87 | sudo apt-get install maven
|
---|
88 |
|
---|
89 | Now switch from the 'vagrant' user to 'htrc' and check out the Java code:
|
---|
90 |
|
---|
91 | sudo su - htrc
|
---|
92 |
|
---|
93 | svn co http://svn.greenstone.org/other-projects/hathitrust/solr-extracted-features/trunk solr-extracted-features
|
---|
94 |
|
---|
95 | Now follow the README file for Step 2:
|
---|
96 |
|
---|
97 | cd solr-extracted-features
|
---|
98 | less README.txt
|
---|
99 |
|
---|
100 | ----
|
---|
101 |
|
---|
102 |
|
---|
103 |
|
---|
104 |
|
---|
105 |
|
---|