Changeset 33446 for gs3-extensions
- Timestamp:
- 2019-08-29T19:12:39+12:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection
- Files:
-
- 2 added
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt
r33443 r33446 10 10 - If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 11 11 12 ----------------------------------- 13 VIEW THE MRI-ONLY INDEX GENERATED 14 ----------------------------------- 15 hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | tail -5 16 17 (gz archive, binary file) 18 19 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv 20 21 # https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop 22 XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv 23 24 25 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv 26 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv 27 Found 1 items 28 -rw-r--r-- 1 vagrant supergroup 71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv 29 30 # https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop 31 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | tail -5 32 33 # url, warc_filename, warc_record_offset, warc_record_length 34 http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404 35 https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444 36 http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311 37 https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408 38 https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321 39 40 # https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command 41 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l 42 345625 43 44 ----------------------------------------- 45 Running export_mri_subset.sh 46 ----------------------------------------- 47 48 The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh 49 50 Running this initially produced the following exception: 51 52 53 2019-08-29 05:48:52 INFO CCIndexExport:152 - Number of records/rows matched by query: 345624 54 2019-08-29 05:48:52 INFO CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file) 55 2019-08-29 05:48:52 INFO CCIndexExport:165 - Repartitioning data to 70 output partitions 56 Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];; 57 'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length] 58 +- AnalysisBarrier 59 +- Repartition 70, true 60 +- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv 61 62 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 63 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) 64 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) 65 at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) 66 at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) 67 at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) 68 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) 69 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) 70 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) 71 at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) 72 at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) 73 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) 74 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 75 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 76 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 77 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 78 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 79 at scala.collection.AbstractTraversable.map(Traversable.scala:104) 80 at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) 81 at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) 82 at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) 83 at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) 84 at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95) 85 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85) 86 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) 87 at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) 88 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) 89 at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) 90 at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) 91 at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) 92 at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) 93 at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) 94 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) 95 at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295) 96 at org.apache.spark.sql.Dataset.select(Dataset.scala:1307) 97 at org.apache.spark.sql.Dataset.select(Dataset.scala:1325) 98 at org.apache.spark.sql.Dataset.select(Dataset.scala:1325) 99 at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169) 100 at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192) 101 at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214) 102 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 103 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 104 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 105 at java.lang.reflect.Method.invoke(Method.java:498) 106 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 107 at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) 108 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) 109 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) 110 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) 111 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 112 2019-08-29 05:48:52 INFO SparkContext:54 - Invoking stop() from shutdown hook 113 114 115 116 Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers 117 The actual solution is to edit the CCIndexWarcExport.java as follows: 118 1. set option(header) to false since the csv file contains no header row, only data rows. 119 2. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc. 120 121 emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java 122 123 Change: 124 sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true) 125 .load(csvQueryResult); 126 To 127 sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true) 128 .load(csvQueryResult); 129 130 And comment out: 131 //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd() 132 .toJavaRDD(); 133 Replace with the default inferred column names: 134 JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd() 135 .toJavaRDD(); 136 137 138 Now recompile: 139 mvn package 140 141 And run: 142 ./src/script/export_mri_subset.sh 143 12 144 ------------------------- 13 14 145 15 146 WET example from https://github.com/commoncrawl/cc-warc-examples … … 75 206 In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)." 76 207 208 ---- 209 CMDS 210 ---- 211 https://stackoverflow.com/questions/29565716/spark-kill-running-application 212 77 213 ========================================================= 78 214 Configuring spark to work on Amazon AWS s3a dataset: … … 122 258 123 259 124 But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml) 260 [If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error: 261 262 2019-08-29 06:16:38 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint 263 2019-08-29 06:16:40 WARN FileStreamSink:66 - Error while looking for metadata directory. 264 Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain 265 at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) 266 at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521) 267 at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) 268 at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) 269 at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) 270 ] 271 272 Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml) 125 273 126 274 you'll want to put the Amazon AWS access key and secret key in the spark properties file: … … 129 277 130 278 131 The spark properties should contain:279 The spark properties conf file above should contain: 132 280 133 281 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 134 spark.hadoop.fs.s3a.access.key= ACCESSKEY135 spark.hadoop.fs.s3a.secret.key= SECRETKEY136 137 138 139 When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me , since I forwarded the vagrant VM's ports at +1)282 spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY 283 spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY 284 285 286 287 When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?) 140 288 141 289 -------------
Note:
See TracChangeset
for help on using the changeset viewer.