Context Navigation

source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 33986

Last change on this file since 33986 was 33986, checked in by ak19, 4 years ago
Dr Bainbridge investigated the original data set more
File size: 7.8 KB

Line
1	"UPPER BOUND"
2
3	blacklisted
4	greylisted
5	skipped crawling
6	unfinished (crawling)
7
8	Sites crawled and ingested into mongodb:
9	- domains shortlisted
10	- not shortlisted
11
12
13	Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
14
15
16	1. ALL DOMAINS FROM CC-CRAWL:
17
18	Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
19
20	wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
21	Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
22	Count of unique domains: 3074
23	Count of unique basic domains (stripped of protocol and www): 2791
24	Line count: 75559
25	Actual unique URL count: 38717
26	Unique basic URL count (stripped of protocol and www): 32827
27	******************************************************
28
29	[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
30
31	Line count above correct with the following: 23794+4485+47280=75559
32
33	But instead of domain/unique domain/URL/basic unique URL counts. The union of:
34	- domains of the following: 1588+288+1462 = 3338
35	- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
36	- basic URL count = 10290 + 2751 + 25683 = 38724
37	- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
38
39
40	wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
41	Counting all domains and urls in discardURLs.txt
42	Count of unique domains: 1588
43	Count of unique basic domains (stripped of protocol and www): 1415
44	Line count: 23794
45	Actual unique URL count: 10290
46	Unique basic URL count (stripped of protocol and www): 9656
47	******************************************************
48	wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
49	Counting all domains and urls in greyListed.txt
50	Count of unique domains: 288
51	Count of unique basic domains (stripped of protocol and www): 277
52	Line count: 4485
53	Actual unique URL count: 2751
54	Unique basic URL count (stripped of protocol and www): 2727
55	******************************************************
56	wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
57	Counting all domains and urls in keepURLs.txt
58	Count of unique domains: 1464
59	Count of unique basic domains (stripped of protocol and www): 1362
60	Line count: 47280
61	Actual unique URL count: 25683
62	Unique basic URL count (stripped of protocol and www): 20451
63	******************************************************
64
65
66	XXXXXXXXXX
67	wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
68	Counting all domains and urls in seedURLs.txt
69	Count of unique domains: 1462
70	Count of unique basic domains (stripped of protocol and www): 1360
71	Line count: 25679
72	Actual unique URL count: 25679
73	Unique basic URL count (stripped of protocol and www): 20447
74	******************************************************
75	XXXXXXXXXX
76
77	seedURLs is a subset of keepURLs.
78
79
80	2a. DISCARDED URLS:
81	URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
82	23794
83
84	b. GREYLISTED URLS:
85	> wc -l greyListed.txt
86	4485
87
88
89	c. keepURLs (the URLs we kept for further processing):
90	wc -l keepURLs.txt
91	47280 keepURLs.txt
92
93
94	d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
95
96	3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
97
98	> less unprocessed-topsite-matches.txt
99	The following domain with seedURLs are on a major/top 500 site
100	for which no allowed URL pattern regex has been specified.
101	Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
102	http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
103	http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
104	http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
105	https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
106
107
108	e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
109
110	wc -l seedURLs.txt
111	25679 seedURLs.txt
112
113	wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
114	In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
115	Count of domains: 1462
116	Count of unique domains: 1360
117
118
119	But anglican.org was wrongly greylisted and added back in
120	-> 1463 domains.
121
122	3a. Num URLs prepared for crawling:
123	wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
124	25679 seedURLs.txt
125
126	b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
127
128
129	wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ \| wc
130	1 1463 10241
131
132	(2nd number)
133	OR: sites>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
134	1463
135
136
137	/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
138
139
140	[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
141	1462+1 (for the greylisted anglican.org) = 1463]
142
143	4. Num sites crawled:
144	wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
145	1447
146	wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ \| wc
147	1 1447 10129
148
149	5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
150	wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" \| wc -l
151	619
152
153
154	6. Number of sites in MongoDB:
155	1446
156
157	Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
158
159	* 01067 is listed under sites crawled, but not ingested into mongodb.
160
161	In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
162	99, 88, 97, 99
163
164	and 64/64 sites in siteIDs 1400-1463.
165
166	=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
167
168
169	7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
170	the number of web pages ingested into mongodb are less than about 5 times as much,
171	because only crawled web pages with non-empty text were ingested into mongodb.
172
173	Num pages in MongoDB:
174	db.getCollection('Webpages').find({}).count()
175	119874
176
177	---------------------------
178
179	#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
180	wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt \| grep protocolStatus \| wc
181	3276 9828 419259
182
183	#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
184	wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt \| wc
185	1027 1027 15405
186	wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt \| wc
187	1027 4108 35945
188
189	# number of dump.txt files
190	wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" \| wc
191	1446 1446 24582
192	wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
193
194
195	Look to see if commoncrawl has a field for how much text there is on the page.
196	Else this is a useful feature for them to add.
197
198

Note: See TracBrowser for help on using the repository browser.

Download in other formats: