Changeset 33441 for gs3-extensions
- Timestamp:
- 2019-08-28T19:30:00+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt
r33440 r33441 10 10 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 11 11 12 12 ------------------------- 13 13 14 14 … … 50 50 --query "SELECT url, warc_filename, warc_record_offset, warc_record_length 51 51 FROM ccindex 52 WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages =LIKE '%mri%'" \52 WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \ 53 53 --numOutputPartitions 12 \ 54 54 --numRecordsPerWarcFile 20000 \ … … 57 57 .../my_output_path/ 58 58 59 60 ========================================================= 61 Configuring spark to work on Amazon AWS s3a dataset: 62 ========================================================= 63 https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark 64 http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/ 65 https://answers.dataiku.com/1734/common-crawl-s3 66 https://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir 67 https://stackoverflow.com/questions/40169610/where-exactly-should-hadoop-tmp-dir-be-set-core-site-xml-or-hdfs-site-xml 68 69 https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w 70 71 =========================================== 72 IAM Role (or user) and commoncrawl profile 73 =========================================== 74 75 "iam" role or user for commoncrawl(er) profile 76 77 78 aws management console: 79 [email protected] 80 lab pwd, capital R and ! (maybe g) 81 82 commoncrawl profile created while creating the user/role, by following the instructions at: https://answers.dataiku.com/1734/common-crawl-s3 83 84 <!-- 85 <property> 86 <name>fs.s3a.awsAccessKeyId</name> 87 <value>XXX</value> 88 </property> 89 <property> 90 <name>fs.s3a.awsSecretAccessKey</name> 91 <value>XXX</value> 92 </property> 93 --> 94 95 96 But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml) 97 98 you'll want to put the Amazon AWS access key and secret key in the spark properties file: 99 100 sudo emacs /usr/local/spark-2.3.0-bin-hadoop2.7/conf/spark-defaults.conf 101 102 103 The spark properties should contain: 104 105 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 106 spark.hadoop.fs.s3a.access.key=ACCESSKEY 107 spark.hadoop.fs.s3a.secret.key=SECRETKEY 108 109 110 ------------- 111 112 APPJAR=target/cc-spark-0.2-SNAPSHOT-jar-with-dependencies.jar 113 $SPARK_HOME/bin/spark-submit \ 114 --conf spark.hadoop.parquet.enable.dictionary=true \ 115 --conf spark.hadoop.parquet.enable.summary-metadata=false \ 116 --conf spark.sql.hive.metastorePartitionPruning=true \ 117 --conf spark.sql.parquet.filterPushdown=true \ 118 --conf spark.sql.parquet.mergeSchema=true \ 119 --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR \ 120 --query "SELECT url, warc_filename, warc_record_offset, warc_record_length 121 FROM ccindex 122 WHERE crawl = 'CC-MAIN-2019-30' AND subset = 'warc' AND content_languages LIKE '%mri%'" \ 123 --outputFormat csv \ 124 --numOutputPartitions 10 \ 125 --outputCompression gzip \ 126 s3://commoncrawl/cc-index/table/cc-main/warc/ \ 127 hdfs:///user/vagrant/cc-mri-csv 59 128 60 129 ----------------
Note:
See TracChangeset
for help on using the changeset viewer.