Last change
on this file since 30916 was 30916, checked in by davidb, 8 years ago |
Some additional details -- note form
|
File size:
1.0 KB
|
Line | |
---|
1 |
|
---|
2 | ----
|
---|
3 | Introduction
|
---|
4 | ----
|
---|
5 |
|
---|
6 | Java code for processing HTRC Extracted Feature JSON files, suitable for
|
---|
7 | ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
|
---|
8 |
|
---|
9 | ----
|
---|
10 | Setup Proceddure
|
---|
11 | ----
|
---|
12 |
|
---|
13 | This is Step 2, of a two step setup procedure.
|
---|
14 |
|
---|
15 | For Step 1, see:
|
---|
16 |
|
---|
17 | http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
|
---|
18 |
|
---|
19 | *Assumptions*
|
---|
20 |
|
---|
21 | * You have 'svn' and 'mvn' on your PATH
|
---|
22 |
|
---|
23 | ----
|
---|
24 | Step 2
|
---|
25 | ----
|
---|
26 |
|
---|
27 |
|
---|
28 | Compile the code:
|
---|
29 |
|
---|
30 | ./COMPILE.bash
|
---|
31 |
|
---|
32 | The first time this is run, a variety of Maven/Java dependencies will be
|
---|
33 | downloaded.
|
---|
34 |
|
---|
35 |
|
---|
36 | Next acquire some JSON files to procesds. For example:
|
---|
37 |
|
---|
38 | ./scripts/PD-GET-FULL-FILE-LIST.sh
|
---|
39 | ./scripts/PD-SELECT-EVERY-10000.sh
|
---|
40 | ./scripts/PD-DOWNLOAD-EVERY-10000.sh
|
---|
41 |
|
---|
42 | Now run the code:
|
---|
43 | ./RUN.bash pd-ef-json-filelist.txt
|
---|
44 |
|
---|
45 |
|
---|
46 | % jps
|
---|
47 | 19468 SecondaryNameNode
|
---|
48 | 19604 Master
|
---|
49 | 19676 Jps
|
---|
50 | 19212 NameNode
|
---|
51 |
|
---|
52 |
|
---|
53 | hdfs -mkdir /user
|
---|
54 | 46 hdfs dfs -mkdir /user
|
---|
55 | 47 hdfs dfs -mkdir /user/htrc
|
---|
56 | 48 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
|
---|
57 |
|
---|
58 |
|
---|
Note:
See
TracBrowser
for help on using the repository browser.