source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 34001

Last change on this file since 34001 was 34001, checked in by ak19, 4 years ago

Tentative total urls from common crawl 12 month cral data.

File size: 10.5 KB
Line 
1The 12 month period CommonCrawl crawl data that we used:
2
3https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
4- contains 2.8 billion web pages and 220 TiB of uncompressed content
5- contains 500 million new URLs, not contained in any crawl archive before
6https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
7- 3.0 billion web pages and 240 TiB of uncompressed content
8- 600 million new URLs, not contained in any crawl archive before
9https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
10- 2.6 billion web pages or 220 TiB of uncompressed content
11- 640 million new URLs, not contained in any crawl archive before
12https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
13- 3.1 billion web pages or 250 TiB of uncompressed content,
14- 735 million URLs not contained in any crawl archive before
15https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
16- 2.85 billion web pages or 240 TiB of uncompressed content
17- 850 million URLs not contained in any crawl archive before.
18https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
19- 2.9 billion web pages or 225 TiB of uncompressed content
20- 750 million URLs not contained in any crawl archive before
21https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
22- 2.55 billion web pages or 210 TiB of uncompressed content
23- 660 million URLs not contained in any crawl archive before
24https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
25- 2.5 billion web pages or 198 TiB of uncompressed content
26- 750 million URLs not contained in any crawl archive before
27https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
28- 2.65 billion web pages or 220 TiB of uncompressed content
29- 825 million URLs not contained in any crawl archive before
30https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
31- 2.6 billion web pages or 220 TiB of uncompressed content
32- 880 million URLs not contained in any crawl archive before
33https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
34- 2.6 billion web pages or 220 TiB of uncompressed content
35- 810 million URLs not contained in any crawl archive before
36https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
37- 2.95 billion web pages or 260 TiB of uncompressed content
38- 1.1 billion URLs not contained in any crawl archive before
39
40= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
41+ taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month = 11.4 billion URLs? At least?
42---------------------------------------------
43
44"UPPER BOUND"
45
46blacklisted
47greylisted
48skipped crawling
49unfinished (crawling)
50
51Sites crawled and ingested into mongodb:
52- domains shortlisted
53- not shortlisted
54
55
56Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
57
58
591. ALL DOMAINS FROM CC-CRAWL:
60
61Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
62
63wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
64Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
65 Count of unique domains: 3074
66 Count of unique basic domains (stripped of protocol and www): 2791
67 Line count: 75559
68 Actual unique URL count: 38717
69 Unique basic URL count (stripped of protocol and www): 32827
70******************************************************
71
72[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
73
74Line count above is correct and consistent with the following: 23794+4485+47280=75559
75
76But instead of domain/unique domain or URL/basic unique URL counts. The union of:
77- domains of the following: 1588+288+1462 = 3338
78- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
79- basic URL count = 10290 + 2751 + 25683 = 38724
80- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
81
82
83wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
84Counting all domains and urls in discardURLs.txt
85 Count of unique domains: 1588
86 Count of unique basic domains (stripped of protocol and www): 1415
87 Line count: 23794
88 Actual unique URL count: 10290
89 Unique basic URL count (stripped of protocol and www): 9656
90******************************************************
91wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
92Counting all domains and urls in greyListed.txt
93 Count of unique domains: 288
94 Count of unique basic domains (stripped of protocol and www): 277
95 Line count: 4485
96 Actual unique URL count: 2751
97 Unique basic URL count (stripped of protocol and www): 2727
98******************************************************
99wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
100Counting all domains and urls in keepURLs.txt
101 Count of unique domains: 1464
102 Count of unique basic domains (stripped of protocol and www): 1362
103 Line count: 47280
104 Actual unique URL count: 25683
105 Unique basic URL count (stripped of protocol and www): 20451
106******************************************************
107
108
109XXXXXXXXXX
110wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
111Counting all domains and urls in seedURLs.txt
112 Count of unique domains: 1462
113 Count of unique basic domains (stripped of protocol and www): 1360
114 Line count: 25679
115 Actual unique URL count: 25679
116 Unique basic URL count (stripped of protocol and www): 20447
117******************************************************
118XXXXXXXXXX
119
120seedURLs is a subset of keepURLs.
121
122
1232a. DISCARDED URLS:
124URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
125
126> wc -l discardURLs.txt
12723794
128
129b. GREYLISTED URLS:
130> wc -l greyListed.txt
1314485
132
133
134c. keepURLs (the URLs we kept for further processing):
135wc -l keepURLs.txt
13647280 keepURLs.txt
137
138
139d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
140
1413 not in MRI but of the same domain, one is just a gallery of holiday pictures.
142
143> less unprocessed-topsite-matches.txt
144 The following domain with seedURLs are on a major/top 500 site
145 for which no allowed URL pattern regex has been specified.
146 Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
147 http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
148 http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
149 http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
150 https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
151
152
153e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:
154
155wc -l seedURLs.txt
15625679 seedURLs.txt
157
158wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
159In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
160 Count of domains: 1462
161 Count of unique domains: 1360
162
163
164But anglican.org was wrongly greylisted and added back in
165-> 1463 domains.
166
1673a. Num URLs prepared for crawling:
168wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
16925679 seedURLs.txt
170
171b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
172
173
174wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
175 1 1463 10241
176
177(2nd number)
178OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1791463
180
181
182/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
183
184
185[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
1861462+1 (for the greylisted anglican.org) = 1463]
187
1884. Num sites crawled:
189wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1901447
191wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
192 1 1447 10129
193
1945. Number of sites not finished crawling (using Nutch at max crawl depth 10):
195wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
196619
197
198
1996. Number of sites in MongoDB:
2001446
201
202Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
203
204* 01067 is listed under sites crawled, but not ingested into mongodb.
205
206In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
20799, 88, 97, 99
208
209and 64/64 sites in siteIDs 1400-1463.
210
211=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
212
213
2147. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
215the number of web pages ingested into mongodb are less than about 5 times as much,
216because only crawled web pages with non-empty text were ingested into mongodb.
217
218Num pages in MongoDB:
219db.getCollection('Webpages').find({}).count()
220119874
221
222---------------------------
223
224#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
225wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
226 3276 9828 419259
227
228#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
229wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
230 1027 1027 15405
231wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
232 1027 4108 35945
233
234# number of dump.txt files
235wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
236 1446 1446 24582
237wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
238
239
240Look to see if commoncrawl has a field for how much text there is on the page.
241Else this is a useful feature for them to add.
242
243
Note: See TracBrowser for help on using the repository browser.