Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33524

Timestamp:

2019-09-26T20:34:12+12:00 (5 years ago)

Author:

ak19

Message:

Further adjustments to documenting what we did to get things to run on the hadoop filesystem. 2. All the hadoop related gitprojects (with patches), separate copy of patches, config modifications and missing jar files that we needed, scripts we created to run on the hdfs machine and its host machine.

Location:

gs3-extensions/maori-lang-detection/hdfs-instructions

Files:

: 23 added
: 1 edited

Readme.txt (modified) (5 diffs)
conf (added)
conf/ia-hadoop-tools-pom.xml (added)
conf/spark-defaults.conf.in (added)
gitprojects (added)
gitprojects/cc-index-table.tar (added)
gitprojects/ia-hadoop-tools.tar (added)
gitprojects/ia-web-commons.tar (added)
jars (added)
jars/aws-java-sdk-1.11.616.jar (added)
jars/aws-java-sdk-1.7.4.jar (added)
jars/guava.jar (added)
jars/hadoop-aws-2.7.6.jar (added)
patches (added)
patches/CCIndexWarcExport.java (added)
patches/CCIndexWarcExport.java.orig (added)
scripts (added)
scripts/GS_README (added)
scripts/export_maori_index_csv.sh (added)
scripts/export_maori_subset.sh (added)
scripts/export_maori_subset_from_scratch.sh (added)
scripts/get_Maori_WET_records_in_cc_from_Sep2018.sh (added)
scripts/get_maori_WET_records_for_crawl.sh (added)
scripts/limit10_export_index.sh (added)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt

-              r33514
+              r33524
        git clone https://github.com/commoncrawl/cc-index-table.git
+       cd cc-index-table
+       mvn package
+. Although cc-index-table will compile successfully, it will throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
+. Modify the toplevel pom.xml file used by maven by changing the spark version used to 2.3.0 and adding a dependency for hadoop-aws 2.7.6, as indicated below:
+c17,18
+<       <spark.version>2.4.1</spark.version>
+---
+>       <!--<spark.version>2.4.1</spark.version>-->
+>       <spark.version>2.3.0</spark.version>
+a137,143
+>       <dependency>
+>         <groupId>org.apache.hadoop</groupId>
+>         <artifactId>hadoop-aws</artifactId>
+>         <version>2.7.6</version>
+>       </dependency>
+>
+. Although cc-index-table will compile successfully after the above modifications, it will nevertheless throw an exception when it's eventually run. To fix that, edit the file "cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java" as follows:
 a. Set option(header) to false, since the csv file contains no header row, only data rows.
 …
                                 .toJavaRDD();
+. Now recompile cc-index-table with the above modifications:
+   (cd cc-index-table)
+    mvn package
+// TODO: link svn committed versions of orig and modified CCIndexWarcExport.java here.
+. Now (re)compile cc-index-table with the above modifications:
+   cd cc-index-table
+   mvn package
 -------------------------------
 …
 . Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener:
+. Add the following into ia-hadoop-tools/pom.xml, in the top of the <dependencies> element, so that maven finds an appropriate version of the org.json package and its JSONTokener (version number found off https://mvnrepository.com/artifact/org.json/json):
    <dependency>
 …
       <version>20131018</version>
     </dependency>
+[
+  UNFAMILAR CHANGES that I don't recollect making and that may have been a consequence of the change in step 1 above:
+  a. These further differences show up between the original version of the file in pom.xml.orig and the modified new pom.xml:
+  ia-hadoop-tools>diff pom.xml.orig pom.xml
+  <       <groupId>org.netpreserve.commons</groupId>
+  <       <artifactId>webarchive-commons</artifactId>
+  <       <version>1.1.1-SNAPSHOT</version>
+  ---
+  >       <groupId>org.commoncrawl</groupId>
+  >       <artifactId>ia-web-commons</artifactId>
+  >       <version>1.1.9-SNAPSHOT</version>
+  b. I don't recollect changing or manually introducing any java files. I just followed the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
+  However, a diff -rq between the latest "ia-hadoop-tools" gitproject checked out a month after the "ia-hadoop-tools.orig" checkout I ran, shows the following differences in files which are not shown as recently modified in github itself in that same period.
+  ia-hadoop-tools> diff -rq ia-hadoop-tools ia-hadoop-tools.orig/
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/io/HDFSTouch.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/io/HDFSTouch.java differ
+  Only in ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs: ArchiveFileExtractor.java
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/CDXGenerator.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/JobDriver.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/JobDriver.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WARCMetadataRecordGenerator.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WATGenerator.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs/WATGenerator.java differ
+  Only in ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/jobs: WEATGenerator.java
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/mapreduce/ZipNumPartitioner.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/pig/ZipNumRecordReader.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java and ia-hadoop-tools.orig/src/main/java/org/archive/hadoop/streaming/ZipNumRecordReader.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/server/FileBackedInputStream.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/FileBackedInputStream.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeClient.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeClient.java differ
+  Files ia-hadoop-tools/src/main/java/org/archive/server/GZRangeServer.java and ia-hadoop-tools.orig/src/main/java/org/archive/server/GZRangeServer.java differ
+]
 . Now can compile ia-hadoop-tools:
 …
 OUTPUT:
 After hours of processing, you should end up with:
+After hours of processing (leave it to run overnight), you should end up with:
       hdfs dfs -ls /user/vagrant/<crawl-timestamp>
 In particular, the zipped wet records at hdfs:///user/vagrant/<crawl-timestamp>/wet/

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33524

Legend:

gs3-extensions/maori-lang-detection/hdfs-instructions/Readme.txt

Download in other formats: