1 |
|
---|
2 | ----
|
---|
3 | Introduction
|
---|
4 | ----
|
---|
5 |
|
---|
6 | Java code for processing HTRC Extracted Feature JSON files, suitable for
|
---|
7 | ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
|
---|
8 |
|
---|
9 | ----
|
---|
10 | Setup Proceddure
|
---|
11 | ----
|
---|
12 |
|
---|
13 | This is Step 2, of a two step setup procedure.
|
---|
14 |
|
---|
15 | For Step 1, see:
|
---|
16 |
|
---|
17 | http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
|
---|
18 |
|
---|
19 | *Assumptions*
|
---|
20 |
|
---|
21 | * You have 'svn' and 'mvn' on your PATH
|
---|
22 |
|
---|
23 | ----
|
---|
24 | Step 2
|
---|
25 | ----
|
---|
26 |
|
---|
27 | 1. Check HDFS and Spark Java daemon processes are running:
|
---|
28 |
|
---|
29 | jps
|
---|
30 |
|
---|
31 | Example output:
|
---|
32 |
|
---|
33 | 19212 NameNode
|
---|
34 | 19468 SecondaryNameNode
|
---|
35 | 19604 Master
|
---|
36 | 19676 Jps
|
---|
37 |
|
---|
38 | [[
|
---|
39 | Starting these processes was previously covered in Step 1, but in brief,
|
---|
40 | after formatting the disk with:
|
---|
41 |
|
---|
42 | hdfs namenode -format
|
---|
43 |
|
---|
44 | The daemons are started with:
|
---|
45 |
|
---|
46 | start-dfs.sh
|
---|
47 | spark-start-all.sh
|
---|
48 |
|
---|
49 | The latter is an alias defined by Step 1 provisioning (created to
|
---|
50 | avoid the conflict over 'start-all.sh', which both Hadoop and
|
---|
51 | Spark define)
|
---|
52 | ]]
|
---|
53 |
|
---|
54 | 2. Acquire some JSON files to process, if not already done so.
|
---|
55 | For example:
|
---|
56 |
|
---|
57 | ./scripts/PD-GET-FULL-FILE-LIST.sh
|
---|
58 | ./scripts/PD-SELECT-EVERY-10000.sh
|
---|
59 | ./scripts/PD-DOWNLOAD-EVERY-10000.sh
|
---|
60 |
|
---|
61 | 3. Push these files over to HDFS
|
---|
62 |
|
---|
63 | hdfs dfs -mkdir /user
|
---|
64 | hdfs dfs -mkdir /user/htrc
|
---|
65 |
|
---|
66 | hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
|
---|
67 | hdfs dfs -put pd-ef-json-files /user/htrc/.
|
---|
68 |
|
---|
69 | 4. Compile the code:
|
---|
70 |
|
---|
71 | ./COMPILE.bash
|
---|
72 |
|
---|
73 | The first time this is run, a variety of Maven/Java dependencies will be
|
---|
74 | downloaded.
|
---|
75 |
|
---|
76 | 5. Run the code on the cluster:
|
---|
77 |
|
---|
78 | ./RUN.bash pd-ef-json-filelist-10000.txt
|
---|
79 |
|
---|
80 |
|
---|
81 | If running subsequently, remove the saved RDD on HDFS first:
|
---|
82 |
|
---|
83 | hdfs dfs -rm -r rdd-solr-json-page-files
|
---|
84 |
|
---|