Changeset 34001

Show
Ignore:
Timestamp:
09.03.2020 18:56:00 (3 weeks ago)
Author:
ak19
Message:

Tentative total urls from common crawl 12 month cral data.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r33999 r34001  
    3838- 1.1 billion URLs not contained in any crawl archive before 
    3939 
     40= 9100 million or 9.1 billion new URLs not contained in any crawl archive before  
     41+ taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month = 11.4 billion URLs? At least? 
    4042--------------------------------------------- 
    4143 
     
    7072[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains] 
    7173 
    72 Line count above correct with the following: 23794+4485+47280=75559 
    73  
    74 But instead of domain/unique domain/URL/basic unique URL counts. The union of: 
     74Line count above is correct and consistent with the following: 23794+4485+47280=75559 
     75 
     76But instead of domain/unique domain or URL/basic unique URL counts. The union of: 
    7577- domains of the following: 1588+288+1462 = 3338 
    7678- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054 
     
    1211232a. DISCARDED URLS: 
    122124URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold) 
     125 
     126> wc -l discardURLs.txt  
    12312723794 
    124128 
     
    147151 
    148152 
    149 e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch: 
     153e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch: 
    150154 
    151155wc -l seedURLs.txt