Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 33985

Last change on this file since 33985 was 33985, checked in by ak19, 4 years ago
Data to back the piechart I need to make that will illustrate how we continuously filtered out the pool of sites and urls returned by commoncrawl for MRI text down to the final web domains and pages we worked with for our samples.
File size: 6.9 KB

Line
1	blacklisted
2	greylisted
3	skipped crawling
4	unfinished (crawling)
5
6	Sites crawled and ingested into mongodb:
7	- domains shortlisted
8	- not shortlisted
9
10
11	Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
12
13
14	1. ALL DOMAINS FROM CC-CRAWL:
15
16	Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
17
18	wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
19	Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
20	Count of unique domains: 3074
21	Count of unique basic domains (stripped of protocol and www): 2791
22	Line count: 75559
23	Actual unique URL count: 38717
24	Unique basic URL count (stripped of protocol and www): 32827
25	******************************************************
26
27	[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
28
29	Line count above correct with the following: 23794+4485+47280=75559
30
31	But instead of domain/unique domain/URL/basic unique URL counts. The union of:
32	- domains of the following: 1588+288+1462 = 3338
33	- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
34	- basic URL count = 10290 + 2751 + 25683 = 38724
35	- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
36
37
38	wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
39	Counting all domains and urls in discardURLs.txt
40	Count of unique domains: 1588
41	Count of unique basic domains (stripped of protocol and www): 1415
42	Line count: 23794
43	Actual unique URL count: 10290
44	Unique basic URL count (stripped of protocol and www): 9656
45	******************************************************
46	wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
47	Counting all domains and urls in greyListed.txt
48	Count of unique domains: 288
49	Count of unique basic domains (stripped of protocol and www): 277
50	Line count: 4485
51	Actual unique URL count: 2751
52	Unique basic URL count (stripped of protocol and www): 2727
53	******************************************************
54	wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
55	Counting all domains and urls in keepURLs.txt
56	Count of unique domains: 1464
57	Count of unique basic domains (stripped of protocol and www): 1362
58	Line count: 47280
59	Actual unique URL count: 25683
60	Unique basic URL count (stripped of protocol and www): 20451
61	******************************************************
62
63
64	XXXXXXXXXX
65	wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
66	Counting all domains and urls in seedURLs.txt
67	Count of unique domains: 1462
68	Count of unique basic domains (stripped of protocol and www): 1360
69	Line count: 25679
70	Actual unique URL count: 25679
71	Unique basic URL count (stripped of protocol and www): 20447
72	******************************************************
73	XXXXXXXXXX
74
75	seedURLs is a subset of keepURLs.
76
77
78	2a. DISCARDED URLS:
79	URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
80	23794
81
82	b. GREYLISTED URLS:
83	> wc -l greyListed.txt
84	4485
85
86
87	c. keepURLs (the URLs we kept for further processing):
88	wc -l keepURLs.txt
89	47280 keepURLs.txt
90
91
92	d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
93
94	3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
95
96	> less unprocessed-topsite-matches.txt
97	The following domain with seedURLs are on a major/top 500 site
98	for which no allowed URL pattern regex has been specified.
99	Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
100	http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
101	http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
102	http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
103	https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
104
105
106	e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
107
108	wc -l seedURLs.txt
109	25679 seedURLs.txt
110
111	wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
112	In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
113	Count of domains: 1462
114	Count of unique domains: 1360
115
116
117	But anglican.org was wrongly greylisted and added back in
118	-> 1463 domains.
119
120	3a. Num URLs prepared for crawling:
121	wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
122	25679 seedURLs.txt
123
124	b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
125
126
127	wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ \| wc
128	1 1463 10241
129
130	(2nd number)
131	OR: sites>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
132	1463
133
134
135	/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
136
137
138	[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
139	1462+1 (for the greylisted anglican.org) = 1463]
140
141	4. Num sites crawled:
142	wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
143	1447
144	wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ \| wc
145	1 1447 10129
146
147	5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
148	wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" \| wc -l
149	619
150
151
152	6. Number of sites in MongoDB:
153	1446
154
155	Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
156
157	* 01067 is listed under sites crawled, but not ingested into mongodb.
158
159	In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
160	99, 88, 97, 99
161
162	and 64/64 sites in siteIDs 1400-1463.
163
164	=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
165
166
167	7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
168	the number of web pages ingested into mongodb are less than about 5 times as much,
169	because only crawled web pages with non-empty text were ingested into mongodb.
170
171	Num pages in MongoDB:
172	db.getCollection('Webpages').find({}).count()
173	119874
174
175

Note: See TracBrowser for help on using the repository browser.

Download in other formats: