- Timestamp:
- 2020-03-09T17:34:10+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r33986 r33999 1 The 12 month period CommonCrawl crawl data that we used: 2 3 https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/ 4 - contains 2.8 billion web pages and 220 TiB of uncompressed content 5 - contains 500 million new URLs, not contained in any crawl archive before 6 https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/ 7 - 3.0 billion web pages and 240 TiB of uncompressed content 8 - 600 million new URLs, not contained in any crawl archive before 9 https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/ 10 - 2.6 billion web pages or 220 TiB of uncompressed content 11 - 640 million new URLs, not contained in any crawl archive before 12 https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/ 13 - 3.1 billion web pages or 250 TiB of uncompressed content, 14 - 735 million URLs not contained in any crawl archive before 15 https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/ 16 - 2.85 billion web pages or 240 TiB of uncompressed content 17 - 850 million URLs not contained in any crawl archive before. 18 https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/ 19 - 2.9 billion web pages or 225 TiB of uncompressed content 20 - 750 million URLs not contained in any crawl archive before 21 https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/ 22 - 2.55 billion web pages or 210 TiB of uncompressed content 23 - 660 million URLs not contained in any crawl archive before 24 https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/ 25 - 2.5 billion web pages or 198 TiB of uncompressed content 26 - 750 million URLs not contained in any crawl archive before 27 https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/ 28 - 2.65 billion web pages or 220 TiB of uncompressed content 29 - 825 million URLs not contained in any crawl archive before 30 https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ 31 - 2.6 billion web pages or 220 TiB of uncompressed content 32 - 880 million URLs not contained in any crawl archive before 33 https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/ 34 - 2.6 billion web pages or 220 TiB of uncompressed content 35 - 810 million URLs not contained in any crawl archive before 36 https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/ 37 - 2.95 billion web pages or 260 TiB of uncompressed content 38 - 1.1 billion URLs not contained in any crawl archive before 39 40 --------------------------------------------- 41 1 42 "UPPER BOUND" 2 43
Note:
See TracChangeset
for help on using the changeset viewer.