Changeset 34001
- Timestamp:
- 2020-03-09T18:56:00+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r33999 r34001 38 38 - 1.1 billion URLs not contained in any crawl archive before 39 39 40 = 9100 million or 9.1 billion new URLs not contained in any crawl archive before 41 + taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month = 11.4 billion URLs? At least? 40 42 --------------------------------------------- 41 43 … … 70 72 [X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains] 71 73 72 Line count above correct with the following: 23794+4485+47280=7555973 74 But instead of domain/unique domain /URL/basic unique URL counts. The union of:74 Line count above is correct and consistent with the following: 23794+4485+47280=75559 75 76 But instead of domain/unique domain or URL/basic unique URL counts. The union of: 75 77 - domains of the following: 1588+288+1462 = 3338 76 78 - unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054 … … 121 123 2a. DISCARDED URLS: 122 124 URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold) 125 126 > wc -l discardURLs.txt 123 127 23794 124 128 … … 147 151 148 152 149 e. After duplicates further pruned out from w aht remained of keepURLs - the seedURLs for Nutch:153 e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch: 150 154 151 155 wc -l seedURLs.txt
Note:
See TracChangeset
for help on using the changeset viewer.