[30915] | 1 |
|
---|
| 2 | ----
|
---|
| 3 | Introduction
|
---|
| 4 | ----
|
---|
| 5 |
|
---|
| 6 | Java code for processing HTRC Extracted Feature JSON files, suitable for
|
---|
| 7 | ingesting into Solr. Designed to be used on a Spark cluster with HDFS.
|
---|
| 8 |
|
---|
| 9 | ----
|
---|
| 10 | Setup Proceddure
|
---|
| 11 | ----
|
---|
| 12 |
|
---|
| 13 | This is Step 2, of a two step setup procedure.
|
---|
| 14 |
|
---|
| 15 | For Step 1, see:
|
---|
| 16 |
|
---|
| 17 | http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt
|
---|
| 18 |
|
---|
| 19 | *Assumptions*
|
---|
| 20 |
|
---|
| 21 | * You have 'svn' and 'mvn' on your PATH
|
---|
| 22 |
|
---|
| 23 | ----
|
---|
| 24 | Step 2
|
---|
| 25 | ----
|
---|
| 26 |
|
---|
[30925] | 27 | 1. Check HDFS and Spark Java daemon processes are running:
|
---|
[30915] | 28 |
|
---|
[30925] | 29 | jps
|
---|
[30915] | 30 |
|
---|
[30925] | 31 | Example output:
|
---|
[30915] | 32 |
|
---|
[30925] | 33 | 19212 NameNode
|
---|
| 34 | 19468 SecondaryNameNode
|
---|
| 35 | 19604 Master
|
---|
| 36 | 19676 Jps
|
---|
[30915] | 37 |
|
---|
[30925] | 38 | [[
|
---|
| 39 | Starting these processes was previously covered in Step 1, but in brief,
|
---|
| 40 | after formatting the disk with:
|
---|
[30915] | 41 |
|
---|
[30925] | 42 | hdfs namenode -format
|
---|
[30915] | 43 |
|
---|
[30925] | 44 | The daemons are started with:
|
---|
| 45 |
|
---|
| 46 | start-dfs.sh
|
---|
| 47 | spark-start-all.sh
|
---|
[30915] | 48 |
|
---|
[30925] | 49 | The latter is an alias defined by Step 1 provisioning (created to
|
---|
| 50 | avoid the conflict over 'start-all.sh', which both Hadoop and
|
---|
| 51 | Spark define)
|
---|
| 52 | ]]
|
---|
[30915] | 53 |
|
---|
[30925] | 54 | 2. Acquire some JSON files to process, if not already done so.
|
---|
| 55 | For example:
|
---|
[30915] | 56 |
|
---|
[30925] | 57 | ./scripts/PD-GET-FULL-FILE-LIST.sh
|
---|
| 58 | ./scripts/PD-SELECT-EVERY-10000.sh
|
---|
| 59 | ./scripts/PD-DOWNLOAD-EVERY-10000.sh
|
---|
[30915] | 60 |
|
---|
[30925] | 61 | 3. Push these files over to HDFS
|
---|
[30915] | 62 |
|
---|
[30925] | 63 | hdfs dfs -mkdir /user
|
---|
| 64 | hdfs dfs -mkdir /user/htrc
|
---|
[30916] | 65 |
|
---|
[30925] | 66 | hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
|
---|
| 67 | hdfs dfs -put pd-ef-json-files /user/htrc/.
|
---|
[30916] | 68 |
|
---|
[30925] | 69 | 4. Compile the code:
|
---|
[30922] | 70 |
|
---|
[30925] | 71 | ./COMPILE.bash
|
---|
[30922] | 72 |
|
---|
[30925] | 73 | The first time this is run, a variety of Maven/Java dependencies will be
|
---|
| 74 | downloaded.
|
---|
| 75 |
|
---|
| 76 | 5. Run the code on the cluster:
|
---|
| 77 |
|
---|
| 78 | ./RUN.bash pd-ef-json-filelist-10000.txt
|
---|
| 79 |
|
---|
| 80 |
|
---|
| 81 |
|
---|
| 82 |
|
---|
| 83 |
|
---|