Changeset 33423 for gs3-extensions


Ignore:
Timestamp:
2019-08-15T20:07:04+12:00 (5 years ago)
Author:
ak19
Message:

Adding in the link to the vagrant VM with Hadoop, Spark for cluster machines (not standalone) that Dr Bainbridge found, as well as links to getting started with vagrant and basic cmds

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33422 r33423  
     1Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
     2
     3https://github.com/martinprobson/vagrant-hadoop-hive-spark
     4
     5Vagrant:
     6    * Guide: https://www.vagrantup.com/intro/getting-started/index.html
     7    * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
     8------------
     9
     10At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says
    111"The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from
    212
     
    919Mixnode turns the web into a database that you can run queries against. Say goodbye to web crawling, forget about web scraping, never run a spider again: get all the web data that you need using simple SQL queries."
    1020
    11 
     21--------------
    1222https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
    1323http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
Note: See TracChangeset for help on using the changeset viewer.