Changeset 33446

Show
Ignore:
Timestamp:
29.08.2019 19:12:39 (3 weeks ago)
Author:
ak19
Message:

1. Committing working version of export_maori_subset.sh which takes the csv file from running export_maori_index.csv.sh as input and gets the warc files at the specified offsets. 2. Notes on the changes necessary to the Java code (cc-index-table/src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java) to get the export_maori_subset.sh to run without exceptions so far. 3. The otherwise untested export_maori_subset_from_scratch.sh script which would perform the sql query and feed that in to getting the WARC records instead of producing an intermediate csv file.

Location:
gs3-extensions/maori-lang-detection
Files:
2 added
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33443 r33446  
    1010- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost|10.211.55.101|node1. 
    1111 
     12----------------------------------- 
     13 VIEW THE MRI-ONLY INDEX GENERATED 
     14----------------------------------- 
     15hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | tail -5 
     16 
     17(gz archive, binary file) 
     18 
     19vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -mkdir hdfs:///user/vagrant/cc-mri-unzipped-csv 
     20 
     21# https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop 
     22XXX vagrant@node1:~/cc-index-table/src/script$ hadoop fs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hadoop fs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv 
     23 
     24 
     25vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | gzip -d | hdfs dfs -put - hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv 
     26vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -ls hdfs:///user/vagrant/cc-mri-unzipped-csv 
     27Found 1 items 
     28-rw-r--r--   1 vagrant supergroup   71664603 2019-08-29 04:47 hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv 
     29 
     30# https://stackoverflow.com/questions/14925323/view-contents-of-file-in-hdfs-hadoop 
     31vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | tail -5 
     32 
     33# url, warc_filename, warc_record_offset, warc_record_length 
     34http://paupauocean.com/page91?product_id=142&brd=1,crawl-data/CC-MAIN-2019-30/segments/1563195526940.0/warc/CC-MAIN-20190721082354-20190721104354-00088.warc.gz,115081770,21404 
     35https://cookinseln-reisen.de/cook-inseln/rarotonga/,crawl-data/CC-MAIN-2019-30/segments/1563195526799.4/warc/CC-MAIN-20190720235054-20190721021054-00289.warc.gz,343512295,12444 
     36http://www.halopharm.com/mi/profile/,crawl-data/CC-MAIN-2019-30/segments/1563195525500.21/warc/CC-MAIN-20190718042531-20190718064531-00093.warc.gz,219160333,10311 
     37https://www.firstpeople.us/pictures/green/Touched-by-the-hand-of-Time-1907.html,crawl-data/CC-MAIN-2019-30/segments/1563195526670.1/warc/CC-MAIN-20190720194009-20190720220009-00362.warc.gz,696195242,5408 
     38https://www.sos-accessoire.com/programmateur-programmateur-module-electronique-whirlpool-481231028062-27573.html,crawl-data/CC-MAIN-2019-30/segments/1563195527048.80/warc/CC-MAIN-20190721144008-20190721170008-00164.warc.gz,830087190,26321 
     39 
     40# https://stackoverflow.com/questions/32612867/how-to-count-lines-in-a-file-on-hdfs-command 
     41vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l 
     42345625 
     43 
     44----------------------------------------- 
     45Running export_mri_subset.sh 
     46----------------------------------------- 
     47 
     48The export_mri_subset.sh script is set up run on the csv input file produced by running export_mri_index_csv.sh 
     49 
     50Running this initially produced the following exception: 
     51 
     52 
     532019-08-29 05:48:52 INFO  CCIndexExport:152 - Number of records/rows matched by query: 345624 
     542019-08-29 05:48:52 INFO  CCIndexExport:157 - Distributing 345624 records to 70 output partitions (max. 5000 records per WARC file) 
     552019-08-29 05:48:52 INFO  CCIndexExport:165 - Repartitioning data to 70 output partitions 
     56Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`url`' given input columns: [http://176.31.110.213:600/?p=287, crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz, 1215489, 15675];; 
     57'Project ['url, 'warc_filename, 'warc_record_offset, 'warc_record_length] 
     58+- AnalysisBarrier 
     59      +- Repartition 70, true 
     60         +- Relation[http://176.31.110.213:600/?p=287#10,crawl-data/CC-MAIN-2019-30/segments/1563195527531.84/warc/CC-MAIN-20190722051628-20190722073628-00547.warc.gz#11,1215489#12,15675#13] csv 
     61 
     62    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 
     63    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) 
     64    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) 
     65    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) 
     66    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) 
     67    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) 
     68    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) 
     69    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) 
     70    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) 
     71    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) 
     72    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) 
     73    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) 
     74    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 
     75    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 
     76    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
     77    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
     78    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 
     79    at scala.collection.AbstractTraversable.map(Traversable.scala:104) 
     80    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120) 
     81    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125) 
     82    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) 
     83    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125) 
     84    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95) 
     85    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85) 
     86    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) 
     87    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) 
     88    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) 
     89    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) 
     90    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) 
     91    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) 
     92    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) 
     93    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) 
     94    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) 
     95    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295) 
     96    at org.apache.spark.sql.Dataset.select(Dataset.scala:1307) 
     97    at org.apache.spark.sql.Dataset.select(Dataset.scala:1325) 
     98    at org.apache.spark.sql.Dataset.select(Dataset.scala:1325) 
     99    at org.commoncrawl.spark.examples.CCIndexWarcExport.run(CCIndexWarcExport.java:169) 
     100    at org.commoncrawl.spark.examples.CCIndexExport.run(CCIndexExport.java:192) 
     101    at org.commoncrawl.spark.examples.CCIndexWarcExport.main(CCIndexWarcExport.java:214) 
     102    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     103    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     104    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     105    at java.lang.reflect.Method.invoke(Method.java:498) 
     106    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
     107    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) 
     108    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) 
     109    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) 
     110    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) 
     111    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
     1122019-08-29 05:48:52 INFO  SparkContext:54 - Invoking stop() from shutdown hook 
     113 
     114 
     115 
     116Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers 
     117The actual solution is to edit the CCIndexWarcExport.java as follows: 
     1181. set option(header) to false since the csv file contains no header row, only data rows. 
     1192. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc. 
     120 
     121emacs src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java 
     122 
     123Change: 
     124    sqlDF = sparkSession.read().format("csv").option("header", true).option("inferSchema", true) 
     125                                        .load(csvQueryResult); 
     126To 
     127    sqlDF = sparkSession.read().format("csv").option("header", false).option("inferSchema", true) 
     128                                        .load(csvQueryResult); 
     129 
     130And comment out: 
     131    //JavaRDD<Row> rdd = sqlDF.select("url", "warc_filename", "warc_record_offset", "warc_record_length").rdd() 
     132                                .toJavaRDD(); 
     133Replace with the default inferred column names: 
     134    JavaRDD<Row> rdd = sqlDF.select("_c0", "_c1", "_c2", "_c3").rdd() 
     135                                .toJavaRDD(); 
     136 
     137 
     138Now recompile: 
     139    mvn package 
     140 
     141And run: 
     142    ./src/script/export_mri_subset.sh 
     143 
    12144------------------------- 
    13  
    14145 
    15146WET example from https://github.com/commoncrawl/cc-warc-examples 
     
    75206In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)." 
    76207 
     208---- 
     209CMDS 
     210---- 
     211https://stackoverflow.com/questions/29565716/spark-kill-running-application 
     212 
    77213========================================================= 
    78214Configuring spark to work on Amazon AWS s3a dataset: 
     
    122258 
    123259 
    124 But instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml) 
     260[If accesskey and secret were specified in hadoop core-site.xml and not in spark conf props file, then running export_maori_index_csv.sh produced the following error: 
     261 
     2622019-08-29 06:16:38 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint 
     2632019-08-29 06:16:40 WARN  FileStreamSink:66 - Error while looking for metadata directory. 
     264Exception in thread "main" com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain 
     265    at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) 
     266    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521) 
     267    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) 
     268    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) 
     269    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) 
     270] 
     271 
     272Instead of putting the access and secret keys in hadoop's core-site.xml as above (with sudo emacs /usr/local/hadoop-2.7.6/etc/hadoop/core-site.xml) 
    125273 
    126274you'll want to put the Amazon AWS access key and secret key in the spark properties file: 
     
    129277 
    130278 
    131 The spark properties should contain: 
     279The spark properties conf file above should contain: 
    132280 
    133281spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   
    134 spark.hadoop.fs.s3a.access.key=ACCESSKEY   
    135 spark.hadoop.fs.s3a.secret.key=SECRETKEY   
    136  
    137  
    138  
    139 When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me, since I forwarded the vagrant VM's ports at +1) 
     282spark.hadoop.fs.s3a.access.key=PASTE_IAM-ROLE_ACCESSKEY   
     283spark.hadoop.fs.s3a.secret.key=PASTE_IAM-ROLE_SECRETKEY   
     284 
     285 
     286 
     287When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me first time, since I forwarded the vagrant VM's ports at +1. However, subsequent times it was on node1:4040/jobs?) 
    140288 
    141289-------------