- Timestamp:
- 2019-08-15T20:07:04+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33422 r33423 1 Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing: 2 3 https://github.com/martinprobson/vagrant-hadoop-hive-spark 4 5 Vagrant: 6 * Guide: https://www.vagrantup.com/intro/getting-started/index.html 7 * Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know 8 ------------ 9 10 At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says 1 11 "The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from 2 12 … … 9 19 Mixnode turns the web into a database that you can run queries against. Say goodbye to web crawling, forget about web scraping, never run a spider again: get all the web data that you need using simple SQL queries." 10 20 11 21 -------------- 12 22 https://commoncrawl.github.io/cc-crawl-statistics/plots/languages 13 23 http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
Note:
See TracChangeset
for help on using the changeset viewer.