Ignore:
Timestamp:
2020-03-09T18:56:00+13:00 (4 years ago)
Author:
ak19
Message:

Tentative total urls from common crawl 12 month cral data.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r33999 r34001  
    3838- 1.1 billion URLs not contained in any crawl archive before
    3939
     40= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
     41+ taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month = 11.4 billion URLs? At least?
    4042---------------------------------------------
    4143
     
    7072[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
    7173
    72 Line count above correct with the following: 23794+4485+47280=75559
    73 
    74 But instead of domain/unique domain/URL/basic unique URL counts. The union of:
     74Line count above is correct and consistent with the following: 23794+4485+47280=75559
     75
     76But instead of domain/unique domain or URL/basic unique URL counts. The union of:
    7577- domains of the following: 1588+288+1462 = 3338
    7678- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
     
    1211232a. DISCARDED URLS:
    122124URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
     125
     126> wc -l discardURLs.txt
    12312723794
    124128
     
    147151
    148152
    149 e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
     153e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:
    150154
    151155wc -l seedURLs.txt
Note: See TracChangeset for help on using the changeset viewer.