Context Navigation

← Previous Change
Next Change →

Changeset 33441 for gs3-extensions

Timestamp:

2019-08-28T19:30:00+12:00 (5 years ago)

Author:

ak19

Message:

Adding further notes to do with running the CC-index examples on spark.

File:

: 1 edited

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

-              r33440
+              r33441
 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
+-------------------------
 …
    --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
             FROM ccindex
             WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages = LIKE '%mri%'" \
+            WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
    --numOutputPartitions 12 \
    --numRecordsPerWarcFile 20000 \
 …
    .../my_output_path/
+=========================================================
+Configuring spark to work on Amazon AWS s3a dataset:
+=========================================================
+https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
+http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
+https://answers.dataiku.com/1734/common-crawl-s3
+https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
+https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
+https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
+===========================================
+IAM Role (or user) and commoncrawl profile
+===========================================
+"iam" role or user for commoncrawl(er) profile
+aws management console:
+[email protected]
+lab pwd, capital R and ! (maybe g)
+commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
+<!--
+    <property>
+      <name>fs.s3a.awsAccessKeyId</name>
+      <value>XXX</value>
+    </property>
+    <property>
+      <name>fs.s3a.awsSecretAccessKey</name>
+      <value>XXX</value>
+    </property>
+-->
+But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
+you'll want to put the Amazon AWS access key and secret key in the spark properties file:
+       sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
+The spark properties should contain:
+spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
+spark.hadoop.fs.s3a.access.key=ACCESSKEY
+spark.hadoop.fs.s3a.secret.key=SECRETKEY
+-------------
+APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
+$SPARK_HOME/bin/spark-submit \
+    --conf spark.hadoop.parquet.enable.dictionary=true \
+    --conf spark.hadoop.parquet.enable.summary-metadata=false \
+    --conf spark.sql.hive.metastorePartitionPruning=true \
+    --conf spark.sql.parquet.filterPushdown=true \
+    --conf spark.sql.parquet.mergeSchema=true \
+    --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
+        --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
+            FROM ccindex
+            WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
+    --outputFormat csv \
+    --numOutputPartitions 10 \
+    --outputCompression gzip \
+    s3://commoncrawl/cc-index/table/cc-main/warc/ \
+    hdfs:///user/vagrant/cc-mri-csv
 ----------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33441 for gs3-extensions

Legend:

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

Download in other formats: