Changeset 33441

Show
Ignore:
Timestamp:
28.08.2019 19:30:00 (3 weeks ago)
Author:
ak19
Message:

Adding further notes to do with running the CC-index examples on spark.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33440 r33441  
    1010- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 
    1111 
    12  
     12------------------------- 
    1313 
    1414 
     
    5050   --query "SELECT url, warc_filename, warc_record_offset, warc_record_length 
    5151            FROM ccindex 
    52             WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages = LIKE '%mri%'" \ 
     52            WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \ 
    5353   --numOutputPartitions 12 \ 
    5454   --numRecordsPerWarcFile 20000 \ 
     
    5757   .../my_output_path/ 
    5858 
     59 
     60========================================================= 
     61Configuring spark to work on Amazon AWS s3a dataset: 
     62========================================================= 
     63https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark 
     64http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/ 
     65https://answers.dataiku.com/1734/common-crawl-s3 
     66https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir 
     67https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml 
     68 
     69https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w 
     70 
     71=========================================== 
     72IAM Role (or user) and commoncrawl profile 
     73=========================================== 
     74 
     75"iam" role or user for commoncrawl(er) profile 
     76 
     77 
     78aws management console: 
     79davidb@waikato.ac.nz 
     80lab pwd, capital R and ! (maybe g) 
     81 
     82commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3 
     83 
     84<!-- 
     85    <property> 
     86      <name>fs.s3a.awsAccessKeyId</name> 
     87      <value>XXX</value> 
     88    </property> 
     89    <property> 
     90      <name>fs.s3a.awsSecretAccessKey</name> 
     91      <value>XXX</value> 
     92    </property> 
     93--> 
     94 
     95 
     96But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml) 
     97 
     98you'll want to put the Amazon AWS access key and secret key in the spark properties file: 
     99 
     100       sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf 
     101 
     102 
     103The spark properties should contain: 
     104 
     105spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   
     106spark.hadoop.fs.s3a.access.key=ACCESSKEY   
     107spark.hadoop.fs.s3a.secret.key=SECRETKEY   
     108 
     109 
     110------------- 
     111 
     112APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar 
     113$SPARK_HOME/bin/spark-submit \ 
     114    --conf spark.hadoop.parquet.enable.dictionary=true \ 
     115    --conf spark.hadoop.parquet.enable.summary-metadata=false \ 
     116    --conf spark.sql.hive.metastorePartitionPruning=true \ 
     117    --conf spark.sql.parquet.filterPushdown=true \ 
     118    --conf spark.sql.parquet.mergeSchema=true \ 
     119    --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \ 
     120        --query "SELECT url, warc_filename, warc_record_offset, warc_record_length    
     121            FROM ccindex                                                         
     122            WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \ 
     123    --outputFormat csv \ 
     124    --numOutputPartitions 10 \ 
     125    --outputCompression gzip \ 
     126    s3://commoncrawl/cc-index/table/cc-main/warc/ \ 
     127    hdfs:///user/vagrant/cc-mri-csv 
    59128 
    60129----------------