source: other-projects/hathitrust/wcsa/extracted-features-solr/trunk/solr-ingest/README.txt@ 31024

Last change on this file since 31024 was 30972, checked in by davidb, 7 years ago

addition of useful command needed before re-running

File size: 1.7 KB
Line 
1
2----
3Introduction
4----
5
6Java code for processing HTRC Extracted Feature JSON files, suitable for
7ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
8
9----
10Setup Proceddure
11----
12
13This is Step 2, of a two step setup procedure.
14
15For Step 1, see:
16
17 http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19*Assumptions*
20
21 * You have 'svn' and 'mvn' on your PATH
22
23----
24Step 2
25----
26
271. Check HDFS and Spark Java daemon processes are running:
28
29 jps
30
31Example output:
32
33 19212 NameNode
34 19468 SecondaryNameNode
35 19604 Master
36 19676 Jps
37
38[[
39 Starting these processes was previously covered in Step 1, but in brief,
40 after formatting the disk with:
41
42 hdfs namenode -format
43
44 The daemons are started with:
45
46 start-dfs.sh
47 spark-start-all.sh
48
49 The latter is an alias defined by Step 1 provisioning (created to
50 avoid the conflict over 'start-all.sh', which both Hadoop and
51 Spark define)
52]]
53
542. Acquire some JSON files to process, if not already done so.
55 For example:
56
57 ./scripts/PD-GET-FULL-FILE-LIST.sh
58 ./scripts/PD-SELECT-EVERY-10000.sh
59 ./scripts/PD-DOWNLOAD-EVERY-10000.sh
60
613. Push these files over to HDFS
62
63 hdfs dfs -mkdir /user
64 hdfs dfs -mkdir /user/htrc
65
66 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
67 hdfs dfs -put pd-ef-json-files /user/htrc/.
68
694. Compile the code:
70
71 ./COMPILE.bash
72
73The first time this is run, a variety of Maven/Java dependencies will be
74downloaded.
75
765. Run the code on the cluster:
77
78 ./RUN.bash pd-ef-json-filelist-10000.txt
79
80
81If running subsequently, remove the saved RDD on HDFS first:
82
83 hdfs dfs -rm -r rdd-solr-json-page-files
84
Note: See TracBrowser for help on using the repository browser.