Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33495

Timestamp:

2019-09-22T19:19:36+12:00 (5 years ago)

Author:

ak19

Message:

Pruned out unused commands, added comments, marked unused variables to be removed in a future version of this script after testing out the full version of this script on CC crawl 2019-26.

File:

: 1 edited

gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh

-              r33494
+              r33495
 #!/bin/bash
+# convert URL index with Spark on Yarn
+# This script is BASED ON the cc-index-table github project's convert_url_index script at
+# https://github.com/commoncrawl/cc-index-table/blob/master/src/script/convert_url_index.sh
+# That script is described as
+# "A Spark job converts the Common Crawl URL index files (a sharded gzipped index in CDXJ format)
+# into a table in Parquet or ORC format." (https://github.com/commoncrawl/cc-index-table)."
+# If you want to run that script, then modify its variables to have the following values before
+# running it, in order for it to work on our machine for doing analytics:
+# EXECUTOR_MEM=3g
+# EXECUTOR_CORES=2
+# NUM_EXECUTORS=2
+# DRIVER_MEM=3g
+# Since that script was copied here, as a result, a lot of such variables (like executor
+# and memory related) are unused here, as they were just copied directly across. Such unused
+# variables can probably be removed from this file.
+# This script was modified to do the following:
+# SQL query CommonCrawl's distributed cc-index table on Amazon S3 for the parameterised crawl timestamp
+# and get all those records for which the primary language in the content_languages field is MRI for Maori.
+# Only the WARC related fields (url, filename, offset and length fields) of each record are requested.
+# The matching records' fields are then constructed into a distributed csv file on the local hdfs system
+# A second phase then requests the warc files at those offsets and downloads them onto the local hdfs.
+# We still get zipped WARC files, but they only contain the pages of that crawl where the primary language
+# was identified as MRI.
+# A third phase converts those WARC files into WET (and WAT) files and copies these zipped files onto the
+# mounted shared space on vagrant.
+#---------------------------- START UNUSED VARIABLES---------------------------#
 # Table format configuration
 …
 PARTITION_BY="crawl,subset"
+# Spark configuration
+SPARK_HOME="$SPARK_HOME"
+# EXECUTOR_MEM=44g
+# EXECUTOR_CORES=12
+# NUM_EXECUTORS=4
+# DRIVER_MEM=4g
+#--- Dr Bainbridge modified the above variables in the original script, convert_url_index.sh,
+# as follows in order to get that spark job to run. Not used in this script. ---#
+EXECUTOR_MEM=3g
+EXECUTOR_CORES=2
+NUM_EXECUTORS=2
+DRIVER_MEM=3g
+#--- VARIABLES PROBABLY ALSO NOT OF USE IN THIS SCRIPT ---#
+SPARK_ON_YARN="--master yarn"
+SPARK_EXTRA_OPTS=""
+# source specific configuration file
+## test -e $(dirname $0)/convert_url_index_conf.sh && . $(dirname $0)/convert_url_index_conf.sh
+#---------------------------- END UNUSED VARIABLES---------------------------#
+# The crawl timestamp, of the form CC-MAIN-2019-26
+# Obtain from http://index.commoncrawl.org/
+CRAWL_ID=$1
+if [ "x$CRAWL_ID" == "x" ]; then
+    echo "No crawl timestamp provided. Should be of the form CC-MAIN-YYYY-COUNT."
+    echo "e.g. CC-MAIN-2019-26. Choose a crawl timestamp from http://index.commoncrawl.org/"
+    exit
+fi
 # Output directory
-CRAWL_ID=$1
 OUTPUT_PARENTDIR=hdfs:///user/vagrant/${CRAWL_ID}
      # or just OUTPUT_PARENTDIR=/user/vagrant/${CRAWL_ID}, since /user/vagrant is on hdfs:
 …
-# Spark configuration
-SPARK_HOME="$SPARK_HOME"
-# EXECUTOR_MEM=44g
-# EXECUTOR_CORES=12
-# NUM_EXECUTORS=4
-# DRIVER_MEM=4g
-EXECUTOR_MEM=3g
-EXECUTOR_CORES=2
-NUM_EXECUTORS=2
-DRIVER_MEM=3g
-SPARK_ON_YARN="--master yarn"
-SPARK_EXTRA_OPTS=""
-# source specific configuration file
-## test -e $(dirname $0)/convert_url_index_conf.sh && . $(dirname $0)/convert_url_index_conf.sh
 _APPJAR=$PWD/target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
 …
 set -x
+OUTPUTDIR="hdfs:///user/vagrant/${CRAWL_ID}/cc-mri-csv"
+# PHASE 1: querying this crawl's massive index with an SQL query that requests just the references to warc files
+# for those crawled web pages where the content_languages field's primary language is MRI (3 letter code for Maori)
+# The output is a distributed .csv file which will be stored in a "cc-mri-csv" subfolder of the $OUTPUT_PARENTDIR.
+#OUTPUTDIR="hdfs:///user/vagrant/${CRAWL_ID}/cc-mri-csv"
+OUTPUTDIR="${OUTPUT_PARENTDIR}/cc-mri-csv"
 #   --conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
 #   --conf spark.hadoop.fs.s3a.secret.key=SECRETKEY \
-#    --conf spark.hadoop.fs.s3a.access.key=AKIA2EVQBWSTBJ2M4BLM \
-#    --conf spark.hadoop.fs.s3a.secret.key=ZVPIboz0brE+Zy8IXyo76wl7GaFrtlr6g4TBKgJt \
 …
 if [ $? == 0 ]; then
     echo "Directory cc-mri-unzipped-csv already exists for crawl ${CRAWL_ID}."
+    echo "Assuming cc-mri.csv also exists inside $OUTPUT_PARENTDIR"
 else
     echo "Creating directory $OUTPUT_PARENTDIR/cc-mri-unzipped-csv..."
     hdfs dfs -mkdir $OUTPUT_PARENTDIR/cc-mri-unzipped-csv
+fi
+echo "Unzipping ${OUTPUTDIR}/part files into $OUTPUT_PARENTDIR/cc-mri-unzipped-csv/cc-mri.csv"
+hdfs dfs -cat $OUTPUTDIR/part* | gzip -d | hdfs dfs -put - $OUTPUT_PARENTDIR/cc-mri-unzipped-csv/cc-mri.csv
+# Now onto phase 2, which uses the index of MRI warc URLs and offsets,
+    echo "Unzipping ${OUTPUTDIR}/part files into $OUTPUT_PARENTDIR/cc-mri-unzipped-csv/cc-mri.csv"
+    hdfs dfs -cat $OUTPUTDIR/part* | gzip -d | hdfs dfs -put - $OUTPUT_PARENTDIR/cc-mri-unzipped-csv/cc-mri.csv
+fi
+# PHASE 2, which uses the index of MRI warc URLs and offsets,
 # stored in the now unzipped .csv file,
 # to get all the WARC records it specifies at the specified warc offsets.
 …
 OUTPUTDIR="hdfs:///user/vagrant/${CRAWL_ID}/warc"
-# $SPARK_HOME/bin/spark-submit \
-#     $SPARK_ON_YARN \
-#     --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
-#     --conf spark.core.connection.ack.wait.timeout=600s \
-#     --conf spark.network.timeout=120s \
-#     --conf spark.task.maxFailures=20 \
-#     --conf spark.shuffle.io.maxRetries=20 \
-#     --conf spark.shuffle.io.retryWait=60s \
-#     --conf spark.driver.memory=$DRIVER_MEM \
-#     --conf spark.executor.memory=$EXECUTOR_MEM \
-#     $SPARK_EXTRA_OPTS \
-#     --num-executors $NUM_EXECUTORS \
-#     --executor-cores $EXECUTOR_CORES \
-#     --executor-memory $EXECUTOR_MEM \
-#     --conf spark.hadoop.parquet.enable.dictionary=true \
-#     --conf spark.sql.parquet.filterPushdown=true \
-#     --conf spark.sql.parquet.mergeSchema=false \
-#     --conf spark.sql.hive.metastorePartitionPruning=true \
-#     --conf spark.hadoop.parquet.enable.summary-metadata=false \
-#     --class org.commoncrawl.spark.CCIndex2Table $_APPJAR \
-#     --outputCompression=$COMPRS \
-#     --outputFormat=$FORMAT $NESTED \
-#     --partitionBy=$PARTITION_BY \
-#     "$DATA" "$OUTPUTDIR"
 #   --conf spark.hadoop.fs.s3a.access.key=ACCESSKEY \
 …
 # Phase 3: convert warc files to wet files and tar them up into the mounted shared area
+# PHASE 3: convert warc files to wet files and copy the wet files into the mounted shared area
 hdfs dfs -test -f $OUTPUTDIR/_SUCCESS

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33495

Legend:

gs3-extensions/maori-lang-detection/bin/script/get_maori_WET_records_for_crawl.sh

Download in other formats: