Changeset 33441 for gs3-extensions


Ignore:
Timestamp:
2019-08-28T19:30:00+12:00 (5 years ago)
Author:
ak19
Message:

Adding further notes to do with running the CC-index examples on spark.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33440 r33441  
    1010- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
    1111
    12 
     12-------------------------
    1313
    1414
     
    5050   --query "SELECT url, warc_filename, warc_record_offset, warc_record_length
    5151            FROM ccindex
    52             WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages = LIKE '%mri%'" \
     52            WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
    5353   --numOutputPartitions 12 \
    5454   --numRecordsPerWarcFile 20000 \
     
    5757   .../my_output_path/
    5858
     59
     60=========================================================
     61Configuring spark to work on Amazon AWS s3a dataset:
     62=========================================================
     63https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
     64http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
     65https://answers.dataiku.com/1734/common-crawl-s3
     66https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir
     67https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml
     68
     69https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w
     70
     71===========================================
     72IAM Role (or user) and commoncrawl profile
     73===========================================
     74
     75"iam" role or user for commoncrawl(er) profile
     76
     77
     78aws management console:
     79[email protected]
     80lab pwd, capital R and ! (maybe g)
     81
     82commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3
     83
     84<!--
     85    <property>
     86      <name>fs.s3a.awsAccessKeyId</name>
     87      <value>XXX</value>
     88    </property>
     89    <property>
     90      <name>fs.s3a.awsSecretAccessKey</name>
     91      <value>XXX</value>
     92    </property>
     93-->
     94
     95
     96But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
     97
     98you'll want to put the Amazon AWS access key and secret key in the spark properties file:
     99
     100       sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf
     101
     102
     103The spark properties should contain:
     104
     105spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
     106spark.hadoop.fs.s3a.access.key=ACCESSKEY 
     107spark.hadoop.fs.s3a.secret.key=SECRETKEY 
     108
     109
     110-------------
     111
     112APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar
     113$SPARK_HOME/bin/spark-submit \
     114    --conf spark.hadoop.parquet.enable.dictionary=true \
     115    --conf spark.hadoop.parquet.enable.summary-metadata=false \
     116    --conf spark.sql.hive.metastorePartitionPruning=true \
     117    --conf spark.sql.parquet.filterPushdown=true \
     118    --conf spark.sql.parquet.mergeSchema=true \
     119    --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \
     120        --query "SELECT url, warc_filename, warc_record_offset, warc_record_length   
     121            FROM ccindex                                                       
     122            WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \
     123    --outputFormat csv \
     124    --numOutputPartitions 10 \
     125    --outputCompression gzip \
     126    s3://commoncrawl/cc-index/table/cc-main/warc/ \
     127    hdfs:///user/vagrant/cc-mri-csv
    59128
    60129----------------
Note: See TracChangeset for help on using the changeset viewer.