Ignore:
Timestamp:
2019-08-29T19:12:39+12:00 (5 years ago)
Author:
ak19
Message:
  1. Committing working version of export_maori_subset.sh which takes the csv file from running export_maori_index.csv.sh as input and gets the warc files at the specified offsets. 2. Notes on the changes necessary to the Java code (cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java) to get the export_maori_subset.sh to run without exceptions so far. 3. The otherwise untested export_maori_subset_from_scratch.sh script which would perform the sql query and feed that in to getting the WARC records instead of producing an intermediate csv file.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33443 r33446  
    1010- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1.
    1111
     12-----------------------------------
     13 VIEW THE MRI-ONLY INDEX GENERATED
     14-----------------------------------
     15hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | tail -5
     16
     17(gz archive, binary file)
     18
     19vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv
     20
     21# https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop
     22XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv
     23
     24
     25vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
     26vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv
     27Found 1 items
     28-rw-r--r--   1 vagrant supergroup   71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv
     29
     30# https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop
     31vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | tail -5
     32
     33# url, warc_filename, warc_record_offset, warc_record_length
     34http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404
     35https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444
     36http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311
     37https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408
     38https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321
     39
     40# https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command
     41vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
     42345625
     43
     44-----------------------------------------
     45Running export_mri_subset.sh
     46-----------------------------------------
     47
     48The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh
     49
     50Running this initially produced the following exception:
     51
     52
     532019-08-29 05:48:52 INFO  CCIndexExport:152 - Number of records/rows matched by query: 345624
     542019-08-29 05:48:52 INFO  CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file)
     552019-08-29 05:48:52 INFO  CCIndexExport:165 - Repartitioning data to 70 output partitions
     56Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];;
     57'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length]
     58+- AnalysisBarrier
     59      +- Repartition 70, true
     60         +- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv
     61
     62    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
     63    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
     64    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
     65    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
     66    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
     67    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
     68    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
     69    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
     70    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
     71    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
     72    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
     73    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
     74    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
     75    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
     76    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
     77    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
     78    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
     79    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
     80    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
     81    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
     82    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
     83    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
     84    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
     85    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
     86    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
     87    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
     88    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
     89    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
     90    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
     91    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
     92    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
     93    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
     94    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
     95    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
     96    at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
     97    at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
     98    at org.apache.spark.sql.Dataset.select(Dataset.scala:1325)
     99    at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169)
     100    at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192)
     101    at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214)
     102    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     103    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     104    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     105    at java.lang.reflect.Method.invoke(Method.java:498)
     106    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
     107    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
     108    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
     109    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
     110    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
     111    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
     1122019-08-29 05:48:52 INFO  SparkContext:54 - Invoking stop() from shutdown hook
     113
     114
     115
     116Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers
     117The actual solution is to edit the CCIndexWarcExport.java as follows:
     1181. set option(header) to false since the csv file contains no header row, only data rows.
     1192. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
     120
     121emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java
     122
     123Change:
     124    sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true)
     125                                        .load(csvQueryResult);
     126To
     127    sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true)
     128                                        .load(csvQueryResult);
     129
     130And comment out:
     131    //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd()
     132                                .toJavaRDD();
     133Replace with the default inferred column names:
     134    JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd()
     135                                .toJavaRDD();
     136
     137
     138Now recompile:
     139    mvn package
     140
     141And run:
     142    ./src/script/export_mri_subset.sh
     143
    12144-------------------------
    13 
    14145
    15146WET example from https://github.com/commoncrawl/cc-warc-examples
     
    75206In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)."
    76207
     208----
     209CMDS
     210----
     211https://stackoverflow.com/questions/29565716/spark-kill-running-application
     212
    77213=========================================================
    78214Configuring spark to work on Amazon AWS s3a dataset:
     
    122258
    123259
    124 But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
     260[If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error:
     261
     2622019-08-29 06:16:38 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
     2632019-08-29 06:16:40 WARN  FileStreamSink:66 - Error while looking for metadata directory.
     264Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
     265    at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
     266    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
     267    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
     268    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
     269    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
     270]
     271
     272Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml)
    125273
    126274you'll want to put the Amazon AWS access key and secret key in the spark properties file:
     
    129277
    130278
    131 The spark properties should contain:
     279The spark properties conf file above should contain:
    132280
    133281spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
    134 spark.hadoop.fs.s3a.access.key=ACCESSKEY 
    135 spark.hadoop.fs.s3a.secret.key=SECRETKEY 
    136 
    137 
    138 
    139 When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me, since I forwarded the vagrant VM's ports at +1)
     282spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY 
     283spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY 
     284
     285
     286
     287When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?)
    140288
    141289-------------
Note: See TracChangeset for help on using the changeset viewer.