Changeset 33422

Show
Ignore:
Timestamp:
15.08.2019 17:52:19 (7 days ago)
Author:
ak19
Message:

Some more links.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33419 r33422  
     1"The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from 
     2 
     3    the continued seed donation of URLs from mixnode.com 
     4    ..." 
     5 
     6https://www.mixnode.com/ 
     7"The entire web, in your hands 
     8 
     9Mixnode turns the web into a database that you can run queries against. Say goodbye to web crawling, forget about web scraping, never run a spider again: get all the web data that you need using simple SQL queries." 
     10 
     11 
    112https://commoncrawl.github.io/cc-crawl-statistics/plots/languages 
    213http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/ 
     
    1930 
    2031http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ 
     32 
     33SPARK (Spark SQL): https://github.com/commoncrawl/cc-index-table 
     34    with example on selecting languages 
    2135 
    2236---