Changeset 33422


Ignore:
Timestamp:
2019-08-15T17:52:19+12:00 (5 years ago)
Author:
ak19
Message:

Some more links.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33419 r33422  
     1"The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from
     2
     3    the continued seed donation of URLs from mixnode.com
     4    ..."
     5
     6https://www.mixnode.com/
     7"The entire web, in your hands
     8
     9Mixnode turns the web into a database that you can run queries against. Say goodbye to web crawling, forget about web scraping, never run a spider again: get all the web data that you need using simple SQL queries."
     10
     11
    112https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
    213http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
     
    1930
    2031http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
     32
     33SPARK (Spark SQL): https://github.com/commoncrawl/cc-index-table
     34    with example on selecting languages
    2135
    2236---
Note: See TracChangeset for help on using the changeset viewer.