Changeset 33443

Show
Ignore:
Timestamp:
28.08.2019 20:22:34 (3 weeks ago)
Author:
ak19
Message:

More notes

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33441 r33443  
    5858 
    5959 
     60---- 
     61TIME 
     62---- 
     631. https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling 
     64http://digitalpebble.blogspot.com/2017/03/need-billions-of-web-pages-dont-bother_29.html 
     65 
     66"So, not only have CommonCrawl given you loads of web data for free, they’ve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you won’t have to process the WARC files. 
     67 
     68This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts." 
     69 
     702. https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands 
     71"Both the parsing part and the processing part take just a couple of minutes per index file / WET file - the bulk of the “compute” lies within actually downloading these files. 
     72 
     73Essentially if you have some time to spare and an unlimited Internet connection, all of this processing can be done on one powerful machine. You can be fancy and go ahead and rent some Amazon server(s) to minimize the download time, but that can be costly. 
     74 
     75In my experience - parsing the whole index for Russian websites (just filtering by language) takes approximately 140 hours - but the majority of this time is just downloading (my speed averaged ~300-500 kb/s)." 
     76 
    6077========================================================= 
    6178Configuring spark to work on Amazon AWS s3a dataset: 
     
    6885 
    6986https://stackoverflow.com/questions/43759896/spark-truncated-the-string-representation-of-a-plan-since-it-was-too-large-w 
     87 
     88 
     89https://sparkour.urizone.net/recipes/using-s3/ 
     90Configuring Spark to Use Amazon S3 
     91"Some Spark tutorials show AWS access keys hardcoded into the file paths. This is a horribly insecure approach and should never be done. Use exported environment variables or IAM Roles instead, as described in Configuring Amazon S3 as a Spark Data Source." 
     92 
     93"No FileSystem for scheme: s3n 
     94 
     95java.io.IOException: No FileSystem for scheme: s3n 
     96 
     97This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the --packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use --jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script." 
    7098 
    7199=========================================== 
     
    107135spark.hadoop.fs.s3a.secret.key=SECRETKEY   
    108136 
     137 
     138 
     139When the job is running, can visit the Spark Context at http://node1:4040/jobs/ (http://node1:4041/jobs/ for me, since I forwarded the vagrant VM's ports at +1) 
    109140 
    110141-------------