Changeset 33524

Show
Ignore:
Timestamp:
26.09.2019 20:34:12 (3 weeks ago)
Author:
ak19
Message:

1. Further adjustments to documenting what we did to get things to run on the hadoop filesystem. 2. All the hadoop related gitprojects (with patches), separate copy of patches, config modifications and missing jar files that we needed, scripts we created to run on the hdfs machine and its host machine.

Location:
gs3-extensions/maori-lang-detection/hdfs-instructions
Files:
23 added
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt

    r33514 r33524  
    213213 
    214214       git clone https://github.com/commoncrawl/cc-index-table.git 
    215        cd cc-index-table 
    216        mvn package 
    217  
    218 2. Although cc-index-table will compile successfully, it will throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows: 
     215 
     2162. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below: 
     217 
     21817c17,18 
     219<       <spark.version>2.4.1</spark.version> 
     220--- 
     221>       <!--<spark.version>2.4.1</spark.version>--> 
     222>       <spark.version>2.3.0</spark.version> 
     223135a137,143 
     224>       <dependency> 
     225>         <groupId>org.apache.hadoop</groupId> 
     226>         <artifactId>hadoop-aws</artifactId> 
     227>         <version>2.7.6</version> 
     228>       </dependency> 
     229>  
     230 
     2313. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows: 
    219232 
    220233a. Set option(header) to false, since the csv file contains no header row, only data rows. 
     
    234247                                .toJavaRDD(); 
    235248 
    236 3. Now recompile cc-index-table with the above modifications: 
    237  
    238    (cd cc-index-table) 
    239     mvn package 
     249// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here. 
     250 
     2514. Now (re)compile cc-index-table with the above modifications: 
     252 
     253   cd cc-index-table 
     254   mvn package 
    240255 
    241256------------------------------- 
     
    254269      
    255270 
    256 2. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener: 
     2712. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json): 
    257272 
    258273   <dependency> 
     
    261276      <version>20131018</version> 
    262277    </dependency> 
     278 
     279[ 
     280  UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above: 
     281  a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml: 
     282  ia-hadoop-tools>diff pom.xml.orig pom.xml 
     283 
     284  <       <groupId>org.netpreserve.commons</groupId> 
     285  <       <artifactId>webarchive-commons</artifactId> 
     286  <       <version>1.1.1-SNAPSHOT</version> 
     287  --- 
     288  >       <groupId>org.commoncrawl</groupId> 
     289  >       <artifactId>ia-web-commons</artifactId> 
     290  >       <version>1.1.9-SNAPSHOT</version> 
     291 
     292  b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 
     293 
     294  However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period. 
     295 
     296  ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/ 
     297  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ 
     298  Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java 
     299  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ 
     300  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ 
     301  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ 
     302  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ 
     303  Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java 
     304  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ 
     305  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ 
     306  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ 
     307  Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ 
     308  Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ 
     309  Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ 
     310] 
    263311 
    2643123. Now can compile ia-hadoop-tools: 
     
    298346 
    299347OUTPUT: 
    300 After hours of processing, you should end up with: 
     348After hours of processing (leave it to run overnight), you should end up with: 
    301349      hdfs dfs -ls /user/vagrant/<crawl-timestamp> 
    302350In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/