Ignore:
Timestamp:
2016-10-25T23:49:36+13:00 (7 years ago)
Author:
davidb
Message:

Improved instrutions

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/hathitrust/solr-extracted-features/trunk/README.txt

    r30922 r30925  
    2525----
    2626
     271. Check HDFS and Spark Java daemon processes are running:
    2728
    28 Compile the code:
     29    jps
    2930
    30   ./COMPILE.bash
     31Example output:
     32
     33    19212 NameNode
     34    19468 SecondaryNameNode
     35    19604 Master
     36    19676 Jps
     37
     38[[
     39  Starting these processes was previously covered in Step 1, but in brief,
     40  after formatting the disk with:
     41
     42    hdfs namenode -format
     43
     44  The daemons are started with:
     45 
     46    start-dfs.sh
     47    spark-start-all.sh
     48
     49  The latter is an alias defined by Step 1 provisioning (created to
     50  avoid the conflict over 'start-all.sh', which both Hadoop and
     51  Spark define)
     52]]
     53
     542. Acquire some JSON files to process, if not already done so.
     55   For example:
     56
     57    ./scripts/PD-GET-FULL-FILE-LIST.sh
     58    ./scripts/PD-SELECT-EVERY-10000.sh
     59    ./scripts/PD-DOWNLOAD-EVERY-10000.sh
     60
     613. Push these files over to HDFS
     62
     63    hdfs dfs -mkdir /user
     64    hdfs dfs -mkdir /user/htrc
     65
     66    hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
     67    hdfs dfs -put pd-ef-json-files /user/htrc/.
     68
     694. Compile the code:
     70
     71    ./COMPILE.bash
    3172
    3273The first time this is run, a variety of Maven/Java dependencies will be
    3374downloaded.
    3475
     765. Run the code on the cluster:
    3577
    36 Next acquire some JSON files to procesds.  For example:
    37 
    38   ./scripts/PD-GET-FULL-FILE-LIST.sh
    39   ./scripts/PD-SELECT-EVERY-10000.sh
    40   ./scripts/PD-DOWNLOAD-EVERY-10000.sh
    41 
    42 Now run the code:
    43   ./RUN.bash pd-ef-json-filelist.txt
     78  ./RUN.bash pd-ef-json-filelist-10000.txt
    4479
    4580
    46 % jps
    47     19468 SecondaryNameNode
    48     19604 Master
    49     19676 Jps
    50     19212 NameNode
    5181
    5282
    53  hdfs -mkdir /user
    54    46  hdfs dfs -mkdir /user
    55    47  hdfs dfs -mkdir /user/htrc
    56    48  hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
    5783
    58 
    59  hdfs dfs -put pd-file-listing-step10000.txt /user/htrc/.
    60  hdfs dfs -put pd-ef-json-files /user/htrc/.
    61 
    62 
Note: See TracChangeset for help on using the changeset viewer.