root/other-projects/hathitrust/solr-extracted-features/trunk/README.txt @ 30915

Revision 30915, 0.8 KB (checked in by davidb, 4 years ago)

Initial cut at instructions to follow to get code set up and running

Line 
1
2----
3Introduction
4----
5
6Java code for processing HTRC Extracted Feature JSON files, suitable for
7ingesting into Solr.  Designed to be used on a Spark cluster with HDFS.
8
9----
10Setup Proceddure
11----
12
13This is Step 2, of a two step setup procedure.
14
15For Step 1, see:
16
17  http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19*Assumptions*
20
21  * You have 'svn' and 'mvn' on your PATH
22
23----
24Step 2
25----
26
27
28Compile the code:
29
30  ./COMPILE.bash
31
32The first time this is run, a variety of Maven/Java dependencies will be
33downloaded.
34
35
36Next acquire some JSON files to procesds.  For example:
37
38  ./scripts/PD-GET-FULL-FILE-LIST.sh
39  ./scripts/PD-SELECT-EVERY-10000.sh
40  ./scripts/PD-DOWNLOAD-EVERY-10000.sh
41
42Now run the code:
43  ./RUN.bash pd-ef-json-filelist.txt
44
45
46
47
Note: See TracBrowser for help on using the browser.