Changeset 30925

Show
Ignore:
Timestamp:
25.10.2016 23:49:36 (3 years ago)
Author:
davidb
Message:

Improved instrutions

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/hathitrust/solr-extracted-features/trunk/README.txt

    r30922 r30925  
    2525---- 
    2626 
     271. Check HDFS and Spark Java daemon processes are running: 
    2728 
    28 Compile the code: 
     29    jps 
    2930 
    30   ./COMPILE.bash 
     31Example output: 
     32 
     33    19212 NameNode 
     34    19468 SecondaryNameNode 
     35    19604 Master 
     36    19676 Jps 
     37 
     38[[ 
     39  Starting these processes was previously covered in Step 1, but in brief, 
     40  after formatting the disk with: 
     41 
     42    hdfs namenode -format 
     43 
     44  The daemons are started with: 
     45   
     46    start-dfs.sh 
     47    spark-start-all.sh 
     48 
     49  The latter is an alias defined by Step 1 provisioning (created to 
     50  avoid the conflict over 'start-all.sh', which both Hadoop and 
     51  Spark define) 
     52]] 
     53 
     542. Acquire some JSON files to process, if not already done so. 
     55   For example: 
     56 
     57    ./scripts/PD-GET-FULL-FILE-LIST.sh 
     58    ./scripts/PD-SELECT-EVERY-10000.sh 
     59    ./scripts/PD-DOWNLOAD-EVERY-10000.sh 
     60 
     613. Push these files over to HDFS 
     62 
     63    hdfs dfs -mkdir /user 
     64    hdfs dfs -mkdir /user/htrc 
     65 
     66    hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/. 
     67    hdfs dfs -put pd-ef-json-files /user/htrc/. 
     68 
     694. Compile the code: 
     70 
     71    ./COMPILE.bash 
    3172 
    3273The first time this is run, a variety of Maven/Java dependencies will be  
    3374downloaded. 
    3475 
     765. Run the code on the cluster: 
    3577 
    36 Next acquire some JSON files to procesds.  For example: 
    37  
    38   ./scripts/PD-GET-FULL-FILE-LIST.sh 
    39   ./scripts/PD-SELECT-EVERY-10000.sh 
    40   ./scripts/PD-DOWNLOAD-EVERY-10000.sh 
    41  
    42 Now run the code: 
    43   ./RUN.bash pd-ef-json-filelist.txt 
     78  ./RUN.bash pd-ef-json-filelist-10000.txt 
    4479 
    4580 
    46 % jps 
    47     19468 SecondaryNameNode 
    48     19604 Master 
    49     19676 Jps 
    50     19212 NameNode 
    5181 
    5282 
    53  hdfs -mkdir /user 
    54    46  hdfs dfs -mkdir /user 
    55    47  hdfs dfs -mkdir /user/htrc 
    56    48  hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/. 
    5783 
    58  
    59  hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/. 
    60  hdfs dfs -put pd-ef-json-files /user/htrc/. 
    61  
    62