Changeset 33524


Ignore:
Timestamp:
2019-09-26T20:34:12+12:00 (5 years ago)
Author:
ak19
Message:
  1. Further adjustments to documenting what we did to get things to run on the hadoop filesystem. 2. All the hadoop related gitprojects (with patches), separate copy of patches, config modifications and missing jar files that we needed, scripts we created to run on the hdfs machine and its host machine.
Location:
gs3-extensions/maori-lang-detection/hdfs-instructions
Files:
23 added
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt

    r33514 r33524  
    213213
    214214       git clone https://github.com/commoncrawl/cc-index-table.git
    215        cd cc-index-table
    216        mvn package
    217 
    218 2. Although cc-index-table will compile successfully, it will throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
     215
     2162. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
     217
     21817c17,18
     219<       <spark.version>2.4.1</spark.version>
     220---
     221>       <!--<spark.version>2.4.1</spark.version>-->
     222>       <spark.version>2.3.0</spark.version>
     223135a137,143
     224>       <dependency>
     225>         <groupId>org.apache.hadoop</groupId>
     226>         <artifactId>hadoop-aws</artifactId>
     227>         <version>2.7.6</version>
     228>       </dependency>
     229>
     230
     2313. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
    219232
    220233a. Set option(header) to false, since the csv file contains no header row, only data rows.
     
    234247                                .toJavaRDD();
    235248
    236 3. Now recompile cc-index-table with the above modifications:
    237 
    238    (cd cc-index-table)
    239     mvn package
     249// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
     250
     2514. Now (re)compile cc-index-table with the above modifications:
     252
     253   cd cc-index-table
     254   mvn package
    240255
    241256-------------------------------
     
    254269     
    255270
    256 2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener:
     2712. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
    257272
    258273   <dependency>
     
    261276      <version>20131018</version>
    262277    </dependency>
     278
     279[
     280  UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
     281  a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
     282  ia-hadoop-tools>diff pom.xml.orig pom.xml
     283
     284  <       <groupId>org.netpreserve.commons</groupId>
     285  <       <artifactId>webarchive-commons</artifactId>
     286  <       <version>1.1.1-SNAPSHOT</version>
     287  ---
     288  >       <groupId>org.commoncrawl</groupId>
     289  >       <artifactId>ia-web-commons</artifactId>
     290  >       <version>1.1.9-SNAPSHOT</version>
     291
     292  b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
     293
     294  However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
     295
     296  ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
     297  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
     298  Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
     299  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
     300  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
     301  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
     302  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
     303  Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
     304  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
     305  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
     306  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
     307  Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
     308  Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
     309  Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
     310]
    263311
    2643123. Now can compile ia-hadoop-tools:
     
    298346
    299347OUTPUT:
    300 After hours of processing, you should end up with:
     348After hours of processing (leave it to run overnight), you should end up with:
    301349      hdfs dfs -ls /user/vagrant/<crawl-timestamp>
    302350In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/
Note: See TracChangeset for help on using the changeset viewer.