---- Introduction ---- Java code for processing HTRC Extracted Feature JSON files, suitable for ingesting into Solr. Designed to be used on a Spark cluster with HDFS. ---- Setup Proceddure ---- This is Step 2, of a two step setup procedure. For Step 1, see: http://svn.greenstone.org/other-projects/hathitrust/vagrant-spark-hdfs-cluster/trunk/README.txt *Assumptions* * You have 'svn' and 'mvn' on your PATH ---- Step 2 ---- Compile the code: ./COMPILE.bash The first time this is run, a variety of Maven/Java dependencies will be downloaded. Next acquire some JSON files to procesds. For example: ./scripts/PD-GET-FULL-FILE-LIST.sh ./scripts/PD-SELECT-EVERY-10000.sh ./scripts/PD-DOWNLOAD-EVERY-10000.sh Now run the code: ./RUN.bash pd-ef-json-filelist.txt % jps 19468 SecondaryNameNode 19604 Master 19676 Jps 19212 NameNode hdfs -mkdir /user 46 hdfs dfs -mkdir /user 47 hdfs dfs -mkdir /user/htrc 48 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.