source: other-projects/hathitrust/solr-extracted-features/trunk/README.txt@ 30919

Last change on this file since 30919 was 30916, checked in by davidb, 7 years ago

Some additional details -- note form

File size: 1.0 KB
Line 
1
2----
3Introduction
4----
5
6Java code for processing HTRC Extracted Feature JSON files, suitable for
7ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
8
9----
10Setup Proceddure
11----
12
13This is Step 2, of a two step setup procedure.
14
15For Step 1, see:
16
17 http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
18
19*Assumptions*
20
21 * You have 'svn' and 'mvn' on your PATH
22
23----
24Step 2
25----
26
27
28Compile the code:
29
30 ./COMPILE.bash
31
32The first time this is run, a variety of Maven/Java dependencies will be
33downloaded.
34
35
36Next acquire some JSON files to procesds. For example:
37
38 ./scripts/PD-GET-FULL-FILE-LIST.sh
39 ./scripts/PD-SELECT-EVERY-10000.sh
40 ./scripts/PD-DOWNLOAD-EVERY-10000.sh
41
42Now run the code:
43 ./RUN.bash pd-ef-json-filelist.txt
44
45
46% jps
47 19468 SecondaryNameNode
48 19604 Master
49 19676 Jps
50 19212 NameNode
51
52
53 hdfs -mkdir /user
54 46 hdfs dfs -mkdir /user
55 47 hdfs dfs -mkdir /user/htrc
56 48 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
57
58
Note: See TracBrowser for help on using the repository browser.