----
Introduction
----

Java code for processing HTRC Extracted Feature JSON files, suitable for 
ingesting into Solr.  Designed to be used on a Spark cluster with HDFS. 

----
Setup Proceddure
----

This is Step 2, of a two step setup procedure.

For Step 1, see:

  http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt

*Assumptions*

  * You have 'svn' and 'mvn' on your PATH

----
Step 2
----


Compile the code:

  ./COMPILE.bash

The first time this is run, a variety of Maven/Java dependencies will be 
downloaded.


Next acquire some JSON files to procesds.  For example:

  ./scripts/PD-GET-FULL-FILE-LIST.sh
  ./scripts/PD-SELECT-EVERY-10000.sh
  ./scripts/PD-DOWNLOAD-EVERY-10000.sh

Now run the code:
  ./RUN.bash pd-ef-json-filelist.txt


% jps
    19468 SecondaryNameNode
    19604 Master
    19676 Jps
    19212 NameNode


 hdfs -mkdir /user
   46  hdfs dfs -mkdir /user
   47  hdfs dfs -mkdir /user/htrc
   48  hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.