root/other-projects/hathitrust/solr-extracted-features/trunk/README.txt @ 30922

Revision 30922, 1.1 KB (checked in by davidb, 4 years ago)

Additional rough-cut notes

Line 
1
2----
3Introduction
4----
5
6Java code for processing HTRC Extracted Feature JSON files, suitable for
7ingesting into Solr.  Designed to be used on a Spark cluster with HDFS.
8
9----
10Setup Proceddure
11----
12
13This is Step 2, of a two step setup procedure.
14
15For Step 1, see:
16
17  http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19*Assumptions*
20
21  * You have 'svn' and 'mvn' on your PATH
22
23----
24Step 2
25----
26
27
28Compile the code:
29
30  ./COMPILE.bash
31
32The first time this is run, a variety of Maven/Java dependencies will be
33downloaded.
34
35
36Next acquire some JSON files to procesds.  For example:
37
38  ./scripts/PD-GET-FULL-FILE-LIST.sh
39  ./scripts/PD-SELECT-EVERY-10000.sh
40  ./scripts/PD-DOWNLOAD-EVERY-10000.sh
41
42Now run the code:
43  ./RUN.bash pd-ef-json-filelist.txt
44
45
46% jps
47    19468 SecondaryNameNode
48    19604 Master
49    19676 Jps
50    19212 NameNode
51
52
53 hdfs -mkdir /user
54   46  hdfs dfs -mkdir /user
55   47  hdfs dfs -mkdir /user/htrc
56   48  hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
57
58
59 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
60 hdfs dfs -put pd-ef-json-files /user/htrc/.
61
62
Note: See TracBrowser for help on using the browser.