Changeset 33999

Show
Ignore:
Timestamp:
09.03.2020 17:34:10 (3 weeks ago)
Author:
ak19
Message:

Common crawl 12 month urls and CC provided stats

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r33986 r33999  
     1The 12 month period CommonCrawl crawl data that we used: 
     2 
     3https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/ 
     4- contains 2.8 billion web pages and 220 TiB of uncompressed content 
     5- contains 500 million new URLs, not contained in any crawl archive before 
     6https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/ 
     7- 3.0 billion web pages and 240 TiB of uncompressed content 
     8- 600 million new URLs, not contained in any crawl archive before 
     9https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/ 
     10- 2.6 billion web pages or 220 TiB of uncompressed content 
     11- 640 million new URLs, not contained in any crawl archive before 
     12https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/ 
     13- 3.1 billion web pages or 250 TiB of uncompressed content, 
     14- 735 million URLs not contained in any crawl archive before 
     15https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/ 
     16- 2.85 billion web pages or 240 TiB of uncompressed content 
     17- 850 million URLs not contained in any crawl archive before. 
     18https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/ 
     19- 2.9 billion web pages or 225 TiB of uncompressed content 
     20- 750 million URLs not contained in any crawl archive before 
     21https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/ 
     22- 2.55 billion web pages or 210 TiB of uncompressed content 
     23- 660 million URLs not contained in any crawl archive before 
     24https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/ 
     25- 2.5 billion web pages or 198 TiB of uncompressed content 
     26- 750 million URLs not contained in any crawl archive before 
     27https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/ 
     28- 2.65 billion web pages or 220 TiB of uncompressed content 
     29- 825 million URLs not contained in any crawl archive before 
     30https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 
     31- 2.6 billion web pages or 220 TiB of uncompressed content 
     32- 880 million URLs not contained in any crawl archive before 
     33https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/ 
     34- 2.6 billion web pages or 220 TiB of uncompressed content 
     35- 810 million URLs not contained in any crawl archive before 
     36https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/ 
     37- 2.95 billion web pages or 260 TiB of uncompressed content 
     38- 1.1 billion URLs not contained in any crawl archive before 
     39 
     40--------------------------------------------- 
     41 
    142"UPPER BOUND" 
    243