Ignore:
Timestamp:
2020-03-09T17:34:10+13:00 (4 years ago)
Author:
ak19
Message:

Common crawl 12 month urls and CC provided stats

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r33986 r33999  
     1The 12 month period CommonCrawl crawl data that we used:
     2
     3https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
     4- contains 2.8 billion web pages and 220 TiB of uncompressed content
     5- contains 500 million new URLs, not contained in any crawl archive before
     6https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
     7- 3.0 billion web pages and 240 TiB of uncompressed content
     8- 600 million new URLs, not contained in any crawl archive before
     9https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
     10- 2.6 billion web pages or 220 TiB of uncompressed content
     11- 640 million new URLs, not contained in any crawl archive before
     12https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
     13- 3.1 billion web pages or 250 TiB of uncompressed content,
     14- 735 million URLs not contained in any crawl archive before
     15https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
     16- 2.85 billion web pages or 240 TiB of uncompressed content
     17- 850 million URLs not contained in any crawl archive before.
     18https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
     19- 2.9 billion web pages or 225 TiB of uncompressed content
     20- 750 million URLs not contained in any crawl archive before
     21https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
     22- 2.55 billion web pages or 210 TiB of uncompressed content
     23- 660 million URLs not contained in any crawl archive before
     24https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
     25- 2.5 billion web pages or 198 TiB of uncompressed content
     26- 750 million URLs not contained in any crawl archive before
     27https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
     28- 2.65 billion web pages or 220 TiB of uncompressed content
     29- 825 million URLs not contained in any crawl archive before
     30https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
     31- 2.6 billion web pages or 220 TiB of uncompressed content
     32- 880 million URLs not contained in any crawl archive before
     33https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
     34- 2.6 billion web pages or 220 TiB of uncompressed content
     35- 810 million URLs not contained in any crawl archive before
     36https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
     37- 2.95 billion web pages or 260 TiB of uncompressed content
     38- 1.1 billion URLs not contained in any crawl archive before
     39
     40---------------------------------------------
     41
    142"UPPER BOUND"
    243
Note: See TracChangeset for help on using the changeset viewer.