Changeset 30925
- Timestamp:
- 2016-10-25T23:49:36+13:00 (7 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/hathitrust/solr-extracted-features/trunk/README.txt
r30922 r30925 25 25 ---- 26 26 27 1. Check HDFS and Spark Java daemon processes are running: 27 28 28 Compile the code: 29 jps 29 30 30 ./COMPILE.bash 31 Example output: 32 33 19212 NameNode 34 19468 SecondaryNameNode 35 19604 Master 36 19676 Jps 37 38 [[ 39 Starting these processes was previously covered in Step 1, but in brief, 40 after formatting the disk with: 41 42 hdfs namenode -format 43 44 The daemons are started with: 45 46 start-dfs.sh 47 spark-start-all.sh 48 49 The latter is an alias defined by Step 1 provisioning (created to 50 avoid the conflict over 'start-all.sh', which both Hadoop and 51 Spark define) 52 ]] 53 54 2. Acquire some JSON files to process, if not already done so. 55 For example: 56 57 ./scripts/PD-GET-FULL-FILE-LIST.sh 58 ./scripts/PD-SELECT-EVERY-10000.sh 59 ./scripts/PD-DOWNLOAD-EVERY-10000.sh 60 61 3. Push these files over to HDFS 62 63 hdfs dfs -mkdir /user 64 hdfs dfs -mkdir /user/htrc 65 66 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/. 67 hdfs dfs -put pd-ef-json-files /user/htrc/. 68 69 4. Compile the code: 70 71 ./COMPILE.bash 31 72 32 73 The first time this is run, a variety of Maven/Java dependencies will be 33 74 downloaded. 34 75 76 5. Run the code on the cluster: 35 77 36 Next acquire some JSON files to procesds. For example: 37 38 ./scripts/PD-GET-FULL-FILE-LIST.sh 39 ./scripts/PD-SELECT-EVERY-10000.sh 40 ./scripts/PD-DOWNLOAD-EVERY-10000.sh 41 42 Now run the code: 43 ./RUN.bash pd-ef-json-filelist.txt 78 ./RUN.bash pd-ef-json-filelist-10000.txt 44 79 45 80 46 % jps47 19468 SecondaryNameNode48 19604 Master49 19676 Jps50 19212 NameNode51 81 52 82 53 hdfs -mkdir /user54 46 hdfs dfs -mkdir /user55 47 hdfs dfs -mkdir /user/htrc56 48 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.57 83 58 59 hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.60 hdfs dfs -put pd-ef-json-files /user/htrc/.61 62
Note:
See TracChangeset
for help on using the changeset viewer.