source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 33999

Last change on this file since 33999 was 33999, checked in by ak19, 4 years ago

Common crawl 12 month urls and CC provided stats

File size: 10.2 KB
Line 
1The 12 month period CommonCrawl crawl data that we used:
2
3https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
4- contains 2.8 billion web pages and 220 TiB of uncompressed content
5- contains 500 million new URLs, not contained in any crawl archive before
6https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
7- 3.0 billion web pages and 240 TiB of uncompressed content
8- 600 million new URLs, not contained in any crawl archive before
9https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
10- 2.6 billion web pages or 220 TiB of uncompressed content
11- 640 million new URLs, not contained in any crawl archive before
12https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
13- 3.1 billion web pages or 250 TiB of uncompressed content,
14- 735 million URLs not contained in any crawl archive before
15https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
16- 2.85 billion web pages or 240 TiB of uncompressed content
17- 850 million URLs not contained in any crawl archive before.
18https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
19- 2.9 billion web pages or 225 TiB of uncompressed content
20- 750 million URLs not contained in any crawl archive before
21https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
22- 2.55 billion web pages or 210 TiB of uncompressed content
23- 660 million URLs not contained in any crawl archive before
24https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
25- 2.5 billion web pages or 198 TiB of uncompressed content
26- 750 million URLs not contained in any crawl archive before
27https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
28- 2.65 billion web pages or 220 TiB of uncompressed content
29- 825 million URLs not contained in any crawl archive before
30https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
31- 2.6 billion web pages or 220 TiB of uncompressed content
32- 880 million URLs not contained in any crawl archive before
33https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
34- 2.6 billion web pages or 220 TiB of uncompressed content
35- 810 million URLs not contained in any crawl archive before
36https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
37- 2.95 billion web pages or 260 TiB of uncompressed content
38- 1.1 billion URLs not contained in any crawl archive before
39
40---------------------------------------------
41
42"UPPER BOUND"
43
44blacklisted
45greylisted
46skipped crawling
47unfinished (crawling)
48
49Sites crawled and ingested into mongodb:
50- domains shortlisted
51- not shortlisted
52
53
54Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
55
56
571. ALL DOMAINS FROM CC-CRAWL:
58
59Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
60
61wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
62Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
63 Count of unique domains: 3074
64 Count of unique basic domains (stripped of protocol and www): 2791
65 Line count: 75559
66 Actual unique URL count: 38717
67 Unique basic URL count (stripped of protocol and www): 32827
68******************************************************
69
70[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
71
72Line count above correct with the following: 23794+4485+47280=75559
73
74But instead of domain/unique domain/URL/basic unique URL counts. The union of:
75- domains of the following: 1588+288+1462 = 3338
76- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
77- basic URL count = 10290 + 2751 + 25683 = 38724
78- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
79
80
81wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
82Counting all domains and urls in discardURLs.txt
83 Count of unique domains: 1588
84 Count of unique basic domains (stripped of protocol and www): 1415
85 Line count: 23794
86 Actual unique URL count: 10290
87 Unique basic URL count (stripped of protocol and www): 9656
88******************************************************
89wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
90Counting all domains and urls in greyListed.txt
91 Count of unique domains: 288
92 Count of unique basic domains (stripped of protocol and www): 277
93 Line count: 4485
94 Actual unique URL count: 2751
95 Unique basic URL count (stripped of protocol and www): 2727
96******************************************************
97wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
98Counting all domains and urls in keepURLs.txt
99 Count of unique domains: 1464
100 Count of unique basic domains (stripped of protocol and www): 1362
101 Line count: 47280
102 Actual unique URL count: 25683
103 Unique basic URL count (stripped of protocol and www): 20451
104******************************************************
105
106
107XXXXXXXXXX
108wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
109Counting all domains and urls in seedURLs.txt
110 Count of unique domains: 1462
111 Count of unique basic domains (stripped of protocol and www): 1360
112 Line count: 25679
113 Actual unique URL count: 25679
114 Unique basic URL count (stripped of protocol and www): 20447
115******************************************************
116XXXXXXXXXX
117
118seedURLs is a subset of keepURLs.
119
120
1212a. DISCARDED URLS:
122URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
12323794
124
125b. GREYLISTED URLS:
126> wc -l greyListed.txt
1274485
128
129
130c. keepURLs (the URLs we kept for further processing):
131wc -l keepURLs.txt
13247280 keepURLs.txt
133
134
135d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
136
1373 not in MRI but of the same domain, one is just a gallery of holiday pictures.
138
139> less unprocessed-topsite-matches.txt
140 The following domain with seedURLs are on a major/top 500 site
141 for which no allowed URL pattern regex has been specified.
142 Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
143 http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
144 http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
145 http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
146 https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
147
148
149e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
150
151wc -l seedURLs.txt
15225679 seedURLs.txt
153
154wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
155In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
156 Count of domains: 1462
157 Count of unique domains: 1360
158
159
160But anglican.org was wrongly greylisted and added back in
161-> 1463 domains.
162
1633a. Num URLs prepared for crawling:
164wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
16525679 seedURLs.txt
166
167b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
168
169
170wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
171 1 1463 10241
172
173(2nd number)
174OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1751463
176
177
178/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
179
180
181[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
1821462+1 (for the greylisted anglican.org) = 1463]
183
1844. Num sites crawled:
185wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1861447
187wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
188 1 1447 10129
189
1905. Number of sites not finished crawling (using Nutch at max crawl depth 10):
191wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
192619
193
194
1956. Number of sites in MongoDB:
1961446
197
198Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
199
200* 01067 is listed under sites crawled, but not ingested into mongodb.
201
202In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
20399, 88, 97, 99
204
205and 64/64 sites in siteIDs 1400-1463.
206
207=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
208
209
2107. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
211the number of web pages ingested into mongodb are less than about 5 times as much,
212because only crawled web pages with non-empty text were ingested into mongodb.
213
214Num pages in MongoDB:
215db.getCollection('Webpages').find({}).count()
216119874
217
218---------------------------
219
220#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
221wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
222 3276 9828 419259
223
224#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
225wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
226 1027 1027 15405
227wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
228 1027 4108 35945
229
230# number of dump.txt files
231wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
232 1446 1446 24582
233wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
234
235
236Look to see if commoncrawl has a field for how much text there is on the page.
237Else this is a useful feature for them to add.
238
239
Note: See TracBrowser for help on using the repository browser.