source: other-projects/hathitrust/solr-extracted-features/trunk/README.txt@ 30925

Last change on this file since 30925 was 30925, checked in by davidb, 7 years ago

Improved instrutions

File size: 1.5 KB
RevLine 
[30915]1
2----
3Introduction
4----
5
6Java code for processing HTRC Extracted Feature JSON files, suitable for
7ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
8
9----
10Setup Proceddure
11----
12
13This is Step 2, of a two step setup procedure.
14
15For Step 1, see:
16
17 http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19*Assumptions*
20
21 * You have 'svn' and 'mvn' on your PATH
22
23----
24Step 2
25----
26
[30925]271. Check HDFS and Spark Java daemon processes are running:
[30915]28
[30925]29 jps
[30915]30
[30925]31Example output:
[30915]32
[30925]33 19212 NameNode
34 19468 SecondaryNameNode
35 19604 Master
36 19676 Jps
[30915]37
[30925]38[[
39 Starting these processes was previously covered in Step 1, but in brief,
40 after formatting the disk with:
[30915]41
[30925]42 hdfs namenode -format
[30915]43
[30925]44 The daemons are started with:
45
46 start-dfs.sh
47 spark-start-all.sh
[30915]48
[30925]49 The latter is an alias defined by Step 1 provisioning (created to
50 avoid the conflict over 'start-all.sh', which both Hadoop and
51 Spark define)
52]]
[30915]53
[30925]542. Acquire some JSON files to process, if not already done so.
55 For example:
[30915]56
[30925]57 ./scripts/PD-GET-FULL-FILE-LIST.sh
58 ./scripts/PD-SELECT-EVERY-10000.sh
59 ./scripts/PD-DOWNLOAD-EVERY-10000.sh
[30915]60
[30925]613. Push these files over to HDFS
[30915]62
[30925]63 hdfs dfs -mkdir /user
64 hdfs dfs -mkdir /user/htrc
[30916]65
[30925]66 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
67 hdfs dfs -put pd-ef-json-files /user/htrc/.
[30916]68
[30925]694. Compile the code:
[30922]70
[30925]71 ./COMPILE.bash
[30922]72
[30925]73The first time this is run, a variety of Maven/Java dependencies will be
74downloaded.
75
765. Run the code on the cluster:
77
78 ./RUN.bash pd-ef-json-filelist-10000.txt
79
80
81
82
83
Note: See TracBrowser for help on using the repository browser.