Changeset 33423

Show
Ignore:
Timestamp:
15.08.2019 20:07:04 (7 days ago)
Author:
ak19
Message:

Adding in the link to the vagrant VM with Hadoop, Spark for cluster machines (not standalone) that Dr Bainbridge found, as well as links to getting started with vagrant and basic cmds

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33422 r33423  
     1Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing: 
     2 
     3https://github.com/martinprobson/vagrant-hadoop-hive-spark 
     4 
     5Vagrant: 
     6    * Guide: https://www.vagrantup.com/intro/getting-started/index.html 
     7    * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know 
     8------------ 
     9 
     10At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says 
    111"The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from 
    212 
     
    919Mixnode turns the web into a database that you can run queries against. Say goodbye to web crawling, forget about web scraping, never run a spider again: get all the web data that you need using simple SQL queries." 
    1020 
    11  
     21-------------- 
    1222https://commoncrawl.github.io/cc-crawl-statistics/plots/languages 
    1323http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/