root/other-projects/hathitrust/solr-extracted-features/trunk/README.txt @ 30925

Revision 30925, 1.5 KB (checked in by davidb, 4 years ago)

Improved instrutions

Line 
1
2----
3Introduction
4----
5
6Java code for processing HTRC Extracted Feature JSON files, suitable for
7ingesting into Solr.  Designed to be used on a Spark cluster with HDFS.
8
9----
10Setup Proceddure
11----
12
13This is Step 2, of a two step setup procedure.
14
15For Step 1, see:
16
17  http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19*Assumptions*
20
21  * You have 'svn' and 'mvn' on your PATH
22
23----
24Step 2
25----
26
271. Check HDFS and Spark Java daemon processes are running:
28
29    jps
30
31Example output:
32
33    19212 NameNode
34    19468 SecondaryNameNode
35    19604 Master
36    19676 Jps
37
38[[
39  Starting these processes was previously covered in Step 1, but in brief,
40  after formatting the disk with:
41
42    hdfs namenode -format
43
44  The daemons are started with:
45 
46    start-dfs.sh
47    spark-start-all.sh
48
49  The latter is an alias defined by Step 1 provisioning (created to
50  avoid the conflict over 'start-all.sh', which both Hadoop and
51  Spark define)
52]]
53
542. Acquire some JSON files to process, if not already done so.
55   For example:
56
57    ./scripts/PD-GET-FULL-FILE-LIST.sh
58    ./scripts/PD-SELECT-EVERY-10000.sh
59    ./scripts/PD-DOWNLOAD-EVERY-10000.sh
60
613. Push these files over to HDFS
62
63    hdfs dfs -mkdir /user
64    hdfs dfs -mkdir /user/htrc
65
66    hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
67    hdfs dfs -put pd-ef-json-files /user/htrc/.
68
694. Compile the code:
70
71    ./COMPILE.bash
72
73The first time this is run, a variety of Maven/Java dependencies will be
74downloaded.
75
765. Run the code on the cluster:
77
78  ./RUN.bash pd-ef-json-filelist-10000.txt
79
80
81
82
83
Note: See TracBrowser for help on using the browser.