Context Navigation

source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 33999

Last change on this file since 33999 was 33999, checked in by ak19, 4 years ago
Common crawl 12 month urls and CC provided stats
File size: 10.2 KB

Line
1	The 12 month period CommonCrawl crawl data that we used:
2
3	https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
4	- contains 2.8 billion web pages and 220 TiB of uncompressed content
5	- contains 500 million new URLs, not contained in any crawl archive before
6	https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
7	- 3.0 billion web pages and 240 TiB of uncompressed content
8	- 600 million new URLs, not contained in any crawl archive before
9	https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
10	- 2.6 billion web pages or 220 TiB of uncompressed content
11	- 640 million new URLs, not contained in any crawl archive before
12	https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
13	- 3.1 billion web pages or 250 TiB of uncompressed content,
14	- 735 million URLs not contained in any crawl archive before
15	https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
16	- 2.85 billion web pages or 240 TiB of uncompressed content
17	- 850 million URLs not contained in any crawl archive before.
18	https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
19	- 2.9 billion web pages or 225 TiB of uncompressed content
20	- 750 million URLs not contained in any crawl archive before
21	https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
22	- 2.55 billion web pages or 210 TiB of uncompressed content
23	- 660 million URLs not contained in any crawl archive before
24	https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
25	- 2.5 billion web pages or 198 TiB of uncompressed content
26	- 750 million URLs not contained in any crawl archive before
27	https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
28	- 2.65 billion web pages or 220 TiB of uncompressed content
29	- 825 million URLs not contained in any crawl archive before
30	https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
31	- 2.6 billion web pages or 220 TiB of uncompressed content
32	- 880 million URLs not contained in any crawl archive before
33	https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
34	- 2.6 billion web pages or 220 TiB of uncompressed content
35	- 810 million URLs not contained in any crawl archive before
36	https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
37	- 2.95 billion web pages or 260 TiB of uncompressed content
38	- 1.1 billion URLs not contained in any crawl archive before
39
40	---------------------------------------------
41
42	"UPPER BOUND"
43
44	blacklisted
45	greylisted
46	skipped crawling
47	unfinished (crawling)
48
49	Sites crawled and ingested into mongodb:
50	- domains shortlisted
51	- not shortlisted
52
53
54	Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
55
56
57	1. ALL DOMAINS FROM CC-CRAWL:
58
59	Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
60
61	wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
62	Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
63	Count of unique domains: 3074
64	Count of unique basic domains (stripped of protocol and www): 2791
65	Line count: 75559
66	Actual unique URL count: 38717
67	Unique basic URL count (stripped of protocol and www): 32827
68	******************************************************
69
70	[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
71
72	Line count above correct with the following: 23794+4485+47280=75559
73
74	But instead of domain/unique domain/URL/basic unique URL counts. The union of:
75	- domains of the following: 1588+288+1462 = 3338
76	- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
77	- basic URL count = 10290 + 2751 + 25683 = 38724
78	- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
79
80
81	wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
82	Counting all domains and urls in discardURLs.txt
83	Count of unique domains: 1588
84	Count of unique basic domains (stripped of protocol and www): 1415
85	Line count: 23794
86	Actual unique URL count: 10290
87	Unique basic URL count (stripped of protocol and www): 9656
88	******************************************************
89	wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
90	Counting all domains and urls in greyListed.txt
91	Count of unique domains: 288
92	Count of unique basic domains (stripped of protocol and www): 277
93	Line count: 4485
94	Actual unique URL count: 2751
95	Unique basic URL count (stripped of protocol and www): 2727
96	******************************************************
97	wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
98	Counting all domains and urls in keepURLs.txt
99	Count of unique domains: 1464
100	Count of unique basic domains (stripped of protocol and www): 1362
101	Line count: 47280
102	Actual unique URL count: 25683
103	Unique basic URL count (stripped of protocol and www): 20451
104	******************************************************
105
106
107	XXXXXXXXXX
108	wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
109	Counting all domains and urls in seedURLs.txt
110	Count of unique domains: 1462
111	Count of unique basic domains (stripped of protocol and www): 1360
112	Line count: 25679
113	Actual unique URL count: 25679
114	Unique basic URL count (stripped of protocol and www): 20447
115	******************************************************
116	XXXXXXXXXX
117
118	seedURLs is a subset of keepURLs.
119
120
121	2a. DISCARDED URLS:
122	URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
123	23794
124
125	b. GREYLISTED URLS:
126	> wc -l greyListed.txt
127	4485
128
129
130	c. keepURLs (the URLs we kept for further processing):
131	wc -l keepURLs.txt
132	47280 keepURLs.txt
133
134
135	d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
136
137	3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
138
139	> less unprocessed-topsite-matches.txt
140	The following domain with seedURLs are on a major/top 500 site
141	for which no allowed URL pattern regex has been specified.
142	Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
143	http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
144	http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
145	http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
146	https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
147
148
149	e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
150
151	wc -l seedURLs.txt
152	25679 seedURLs.txt
153
154	wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
155	In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
156	Count of domains: 1462
157	Count of unique domains: 1360
158
159
160	But anglican.org was wrongly greylisted and added back in
161	-> 1463 domains.
162
163	3a. Num URLs prepared for crawling:
164	wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
165	25679 seedURLs.txt
166
167	b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
168
169
170	wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ \| wc
171	1 1463 10241
172
173	(2nd number)
174	OR: sites>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
175	1463
176
177
178	/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
179
180
181	[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
182	1462+1 (for the greylisted anglican.org) = 1463]
183
184	4. Num sites crawled:
185	wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
186	1447
187	wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ \| wc
188	1 1447 10129
189
190	5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
191	wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" \| wc -l
192	619
193
194
195	6. Number of sites in MongoDB:
196	1446
197
198	Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
199
200	* 01067 is listed under sites crawled, but not ingested into mongodb.
201
202	In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
203	99, 88, 97, 99
204
205	and 64/64 sites in siteIDs 1400-1463.
206
207	=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
208
209
210	7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
211	the number of web pages ingested into mongodb are less than about 5 times as much,
212	because only crawled web pages with non-empty text were ingested into mongodb.
213
214	Num pages in MongoDB:
215	db.getCollection('Webpages').find({}).count()
216	119874
217
218	---------------------------
219
220	#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
221	wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt \| grep protocolStatus \| wc
222	3276 9828 419259
223
224	#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
225	wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt \| wc
226	1027 1027 15405
227	wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt \| wc
228	1027 4108 35945
229
230	# number of dump.txt files
231	wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" \| wc
232	1446 1446 24582
233	wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
234
235
236	Look to see if commoncrawl has a field for how much text there is on the page.
237	Else this is a useful feature for them to add.
238
239

Note: See TracBrowser for help on using the repository browser.

Download in other formats: